# Readme Information Provided

Summary
=======

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


Content and Use of Files
========================

Formatting and Encoding
-----------------------

The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.


User Ids
--------

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).


Movie Ids
---------

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).


Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


Tags Data File Structure (tags.csv)
-----------------------------------

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)


Links Data File Structure (links.csv)
---------------------------------------

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

Use of the resources listed above is subject to the terms of each provider.


In [1]:
import os
import pandas as pd
from datetime import datetime

In [2]:
os.listdir('data')

['links.csv',
 'tags.csv',
 'ratings.csv',
 'imdb_movies.json',
 'README.txt',
 'movies.csv']

# Ratings

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


In [3]:
ratings = pd.read_csv('data/ratings.csv')

In [4]:
ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [16]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,2000-07-30 14:45:03
1,1,3,4.0,2000-07-30 14:20:47
2,1,6,4.0,2000-07-30 14:37:04
3,1,47,5.0,2000-07-30 15:03:35
4,1,50,5.0,2000-07-30 14:48:51


In [6]:
ratings.shape

(100836, 4)

In [15]:
ratings['timestamp'] = ratings['timestamp'].apply(datetime.fromtimestamp)

In [7]:
ratings.nunique()

userId         610
movieId       9724
rating          10
timestamp    85043
dtype: int64

In [8]:
ratings['rating'].value_counts().sort_index()

0.5     1370
1.0     2811
1.5     1791
2.0     7551
2.5     5550
3.0    20047
3.5    13136
4.0    26818
4.5     8551
5.0    13211
Name: rating, dtype: int64

# Tags

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [9]:
tags = pd.read_csv('data/tags.csv')

In [10]:
tags.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

In [11]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [12]:
tags.shape

(3683, 4)

In [13]:
tags['tag'].value_counts()

In Netflix queue     131
atmospheric           36
thought-provoking     24
superhero             24
funny                 23
                    ... 
small towns            1
In Your Eyes           1
Lloyd Dobbler          1
weak plot              1
Heroic Bloodshed       1
Name: tag, Length: 1589, dtype: int64

# Movies

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [14]:
movies = pd.read_csv("data/movies.csv")

In [15]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [17]:
movies.shape

(9742, 3)

In [18]:
movies['genres'] = movies['genres'].apply(lambda x: x.split("|"))

In [19]:
movies_agg = movies.explode('genres')

In [20]:
movies_agg = pd.pivot_table(movies_agg, values = 'movieId', index = 'movieId', 
               columns = 'genres', aggfunc = len).reset_index()

In [21]:
movies_agg.columns = ['genre_' + x.replace(" ", "_").replace("-","_").replace("(","").replace(")","") \
                      if x != 'movieId' else x for x in movies_agg.columns]

In [22]:
movies_agg = movies_agg.fillna(0)

In [23]:
for col in [x for x in movies_agg.columns if 'genre_' in x]:
    movies_agg[col] = movies_agg[col].astype(bool)

In [24]:
movies = movies.drop(columns = 'genres').merge(movies_agg, on = 'movieId')

In [25]:
movies

Unnamed: 0,movieId,title,genre_no_genres_listed,genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,genre_Documentary,...,genre_Film_Noir,genre_Horror,genre_IMAX,genre_Musical,genre_Mystery,genre_Romance,genre_Sci_Fi,genre_Thriller,genre_War,genre_Western
0,1,Toy Story (1995),False,False,True,True,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2,Jumanji (1995),False,False,True,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,3,Grumpier Old Men (1995),False,False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
3,4,Waiting to Exhale (1995),False,False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
4,5,Father of the Bride Part II (1995),False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),False,True,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
9738,193583,No Game No Life: Zero (2017),False,False,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
9739,193585,Flint (2017),False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9740,193587,Bungo Stray Dogs: Dead Apple (2018),False,True,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [27]:
movies.isna().sum()

movieId                   0
title                     0
genre_no_genres_listed    0
genre_Action              0
genre_Adventure           0
genre_Animation           0
genre_Children            0
genre_Comedy              0
genre_Crime               0
genre_Documentary         0
genre_Drama               0
genre_Fantasy             0
genre_Film_Noir           0
genre_Horror              0
genre_IMAX                0
genre_Musical             0
genre_Mystery             0
genre_Romance             0
genre_Sci_Fi              0
genre_Thriller            0
genre_War                 0
genre_Western             0
dtype: int64

In [28]:
movies.nunique()

movieId                   9742
title                     9737
genre_no_genres_listed       2
genre_Action                 2
genre_Adventure              2
genre_Animation              2
genre_Children               2
genre_Comedy                 2
genre_Crime                  2
genre_Documentary            2
genre_Drama                  2
genre_Fantasy                2
genre_Film_Noir              2
genre_Horror                 2
genre_IMAX                   2
genre_Musical                2
genre_Mystery                2
genre_Romance                2
genre_Sci_Fi                 2
genre_Thriller               2
genre_War                    2
genre_Western                2
dtype: int64

In [29]:
# All of the rated movies have a genre
ratings[~ratings['movieId'].isin(movies['movieId'])]

Unnamed: 0,userId,movieId,rating,timestamp


# Links

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

Use of the resources listed above is subject to the terms of each provider.

In [30]:
links = pd.read_csv("data/links.csv")

In [31]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


# Creating Users

In [32]:
users = ratings.drop_duplicates(subset = ['userId'])
users = users[['userId']]

In [33]:
users

Unnamed: 0,userId
0,1
232,2
261,3
300,4
516,5
...,...
97364,606
98479,607
98666,608
99497,609


# Building Data Set

In [34]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [35]:
movies

Unnamed: 0,movieId,title,genre_no_genres_listed,genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,genre_Documentary,...,genre_Film_Noir,genre_Horror,genre_IMAX,genre_Musical,genre_Mystery,genre_Romance,genre_Sci_Fi,genre_Thriller,genre_War,genre_Western
0,1,Toy Story (1995),False,False,True,True,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2,Jumanji (1995),False,False,True,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,3,Grumpier Old Men (1995),False,False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
3,4,Waiting to Exhale (1995),False,False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
4,5,Father of the Bride Part II (1995),False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),False,True,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
9738,193583,No Game No Life: Zero (2017),False,False,False,True,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
9739,193585,Flint (2017),False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9740,193587,Bungo Stray Dogs: Dead Apple (2018),False,True,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [36]:
data = ratings.merge(movies, on = 'movieId')

In [37]:
data = data.merge(links, on = 'movieId')

In [38]:
data

Unnamed: 0,userId,movieId,rating,timestamp,title,genre_no_genres_listed,genre_Action,genre_Adventure,genre_Animation,genre_Children,...,genre_IMAX,genre_Musical,genre_Mystery,genre_Romance,genre_Sci_Fi,genre_Thriller,genre_War,genre_Western,imdbId,tmdbId
0,1,1,4.0,964982703,Toy Story (1995),False,False,True,True,True,...,False,False,False,False,False,False,False,False,114709,862.0
1,5,1,4.0,847434962,Toy Story (1995),False,False,True,True,True,...,False,False,False,False,False,False,False,False,114709,862.0
2,7,1,4.5,1106635946,Toy Story (1995),False,False,True,True,True,...,False,False,False,False,False,False,False,False,114709,862.0
3,15,1,2.5,1510577970,Toy Story (1995),False,False,True,True,True,...,False,False,False,False,False,False,False,False,114709,862.0
4,17,1,4.5,1305696483,Toy Story (1995),False,False,True,True,True,...,False,False,False,False,False,False,False,False,114709,862.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100831,610,160341,2.5,1479545749,Bloodmoon (1997),False,True,False,False,False,...,False,False,False,False,False,True,False,False,118745,30948.0
100832,610,160527,4.5,1479544998,Sympathy for the Underdog (1971),False,True,False,False,False,...,False,False,False,False,False,False,False,False,66806,90351.0
100833,610,160836,3.0,1493844794,Hazard (2005),False,True,False,False,False,...,False,False,False,False,False,True,False,False,798722,70193.0
100834,610,163937,3.5,1493848789,Blair Witch (2016),False,False,False,False,False,...,False,False,False,False,False,True,False,False,1540011,351211.0


In [39]:
data_json = data.to_json(orient = 'records')

In [40]:
data_json = json.loads(data_json)

In [41]:
data_json = {"data": data_json}

In [42]:
data_json = json.dumps(data_json)

In [43]:
with open('data/imdb_movies.json', 'w') as f:
    f.write(data_json)