# Converting Movie Lens datasets to TSV
This tutorial outlines a process for downloading and normalizing movie rating datasets from GroupLens Research. The tutorial demonstrates how to process and normalize the Tag Genome dataset and the MovieLens 100K and MovieLens 1M datasets, which are stored in different formats and require different approaches to normalization.

## What is movie lens?
GroupLens Research is an organization that has made available a variety of datasets containing movie ratings and related information. These datasets include the MovieLens 25M dataset, which contains 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users, as well as the MovieLens Latest datasets, which include small and full versions with ratings and tag data for a smaller or larger number of movies, respectively. The organization also has synthetic datasets available, such as the MovieLens 1B Synthetic dataset, which is an expanded version of the ML-20M dataset, and older datasets like the MovieLens 100K and MovieLens 1M datasets. These datasets can be used for research, education, and development purposes, and users interested in using them should review the README files for usage licenses and other details before doing so.

## What is this pipeline?
This tutorial outlines a process for downloading and normalizing a variety of movie rating datasets from GroupLens Research. The datasets are stored in different formats and require distinct normalization pipelines to prepare them for analysis. The tutorial begins by installing the necessary Python packages and loading a list of dataset URLs from a JSON file. The datasets are then downloaded using the BaseDownloader class and stored in a local directory. The tutorial then shows how to process and normalize the Tag Genome dataset, which includes files containing information about movies, tags, and tag relevance scores. The tutorial demonstrates how to extract the year of release and other relevant data from the movie names, how to merge the movies and tags data into a single dataframe, and how to filter the tag relevance scores to remove outliers and invalid data. The tutorial then shows how to process and normalize the MovieLens 100K and MovieLens 1M datasets, which are stored in different formats and require different approaches to normalization.

I am collecting here the Python packages requirements I will be using below.

In [1]:
!pip install tqdm pgeocode downloaders pandas numpy -Uq

In [2]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from downloaders import BaseDownloader
import compress_json
import pgeocode

We retrieve the dataset URLs JSON data which I prepared:

In [3]:
datasets = compress_json.load("datasets.json")

In [4]:
BaseDownloader(
    process_number=1
).download([
    dataset["url"]
    for dataset in datasets
])

Downloading files:   0%|                                                                                      …

Unnamed: 0,status_code,file_size,downloaded_file_size,url,destination,success,cached,exception,extraction_file_size,extraction_destination,extraction_cached,extraction_success
0,200,5917549,5917549,https://files.grouplens.org/datasets/movielens...,downloads/ml-1m.zip,True,True,,96,downloads/ml-1m,True,True
1,200,43510670,43510670,https://files.grouplens.org/datasets/tag-genom...,downloads/tag-genome.zip,True,True,,96,downloads/tag-genome,True,True
2,200,198702078,198702078,https://files.grouplens.org/datasets/movielens...,downloads/ml-20m.zip,True,True,,96,downloads/ml-20m,True,True
3,200,65566137,65566137,https://files.grouplens.org/datasets/movielens...,downloads/ml-10m.zip,True,True,,96,downloads/ml-10m,True,True
4,200,4924029,4924029,https://files.grouplens.org/datasets/movielens...,downloads/ml-100k.zip,True,True,,96,downloads/ml-100k,True,True
5,200,3327436800,3327436800,https://files.grouplens.org/datasets/movielens...,downloads/ml-20mx16x32.tar,True,True,,96,downloads/ml-20mx16x32,True,True


The various datasets have very different formats, and this of course means that for each an every dataset we will have to write a distinct normalization pipeline. Why people don't like a couple of TSVs with node list and edge list?

![Standards](https://imgs.xkcd.com/comics/standards.png)

Let's get start. A journey of a thousand miles begins with a single step and all that...

# Tag Genome Dataset 2014
The [Tag Genome Dataset](https://grouplens.org/datasets/movielens/tag-genome/) contains 11 million computed tag-movie relevance scores from a pool of 1,100 tags applied to 10,000 movies. Released 3/2014.

In [8]:
!tree downloads/tag-genome/tag-genome

[01;34mdownloads/tag-genome/tag-genome[0m
├── [00mREADME.htm[0m
├── [00mmovies.dat[0m
├── [00mtag_relevance.dat[0m
└── [00mtags.dat[0m

0 directories, 4 files


Let's see what each of these files looks like, [for which there is some documentation here](https://files.grouplens.org/datasets/tag-genome/README.html#file_desc).

In [48]:
node_types = pd.DataFrame({
    "node_type": ["Movie", "Tag"]
})

node_types.to_csv("movie_lens_tag_genome_2014_node_type_list.tsv.xz", sep="\t", index=False)

In [33]:
movies = pd.read_csv(
    "downloads/tag-genome/tag-genome/movies.dat",
    sep="\t",
    header=None,
    index_col=0
)

movies.columns = ["node_name", "popularity"]

# We get the year info for the movies
movies["year"] = movies\
    .node_name.str.rsplit("(", n=1, expand=True)[1]\
    .str.strip(") ").astype(np.int16)

# We cannot remove the year from the movie name as
# the movie names woul become duplicated.

# We assign a predefined clear node type
movies["node_type"] = 0

# We create a densified range, as the default IDs are not a dense range
movies["node_id"] = np.arange(movies.shape[0])

movies

Unnamed: 0_level_0,node_name,popularity,year,node_type,node_id
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Toy Story (1995),53059,1995,0,0
2,Jumanji (1995),22466,1995,0,1
3,Grumpier Old Men (1995),15111,1995,0,2
4,Waiting to Exhale (1995),2898,1995,0,3
5,Father of the Bride Part II (1995),14323,1995,0,4
...,...,...,...,...,...
106920,Her (2013),368,2013,0,9729
107069,Lone Survivor (2013),90,2013,0,9730
107141,Saving Mr. Banks (2013),153,2013,0,9731
107348,Anchorman 2: The Legend Continues (2013),83,2013,0,9732


In [32]:
tags = pd.read_csv(
    "downloads/tag-genome/tag-genome/tags.dat",
    sep="\t",
    header=None,
    index_col=0
)

tags.columns = ["node_name", "popularity"]

# We assign a predefined clear node type
tags["node_type"] = 1

tags

Unnamed: 0_level_0,node_name,popularity,node_type
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,007,61,1
1,007 (series),24,1
2,18th century,37,1
3,1920s,42,1
4,1930s,55,1
...,...,...,...
1123,writing,49,1
1124,wuxia,17,1
1125,wwii,73,1
1126,zombie,81,1


We create the combined node list:

In [40]:
node_list = pd.concat([
    movies[["node_name", "popularity", "node_type", "year"]],
    tags
])
node_list

Unnamed: 0_level_0,node_name,popularity,node_type,year
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),53059,0,1995.0
2,Jumanji (1995),22466,0,1995.0
3,Grumpier Old Men (1995),15111,0,1995.0
4,Waiting to Exhale (1995),2898,0,1995.0
5,Father of the Bride Part II (1995),14323,0,1995.0
...,...,...,...,...
1123,writing,49,1,
1124,wuxia,17,1,
1125,wwii,73,1,
1126,zombie,81,1,


In [38]:
weighted_edges = pd.read_csv(
    "downloads/tag-genome/tag-genome/tag_relevance.dat",
    sep="\t",
    header=None,
)

weighted_edges.columns = ["source", "destination", "edge_weight"]

# Remapping sources to the densified range
weighted_edges["source"] = movies.node_id.loc[weighted_edges.source].values

# Shifting destinations, which are already a dense range, so they
# have node Ids align with the end of movies.
weighted_edges["destination"] += movies.shape[0]

weighted_edges

Unnamed: 0,source,destination,edge_weight
0,0,9734,0.032
1,0,9735,0.035
2,0,9736,0.070
3,0,9737,0.114
4,0,9738,0.105
...,...,...,...
10979947,9733,10857,0.327
10979948,9733,10858,0.030
10979949,9733,10859,0.006
10979950,9733,10860,0.161


And done! Now we have both a node list and an edge list, and we can save them to disk.

In [45]:
node_list.to_csv("movie_lens_tag_genome_2014_node_list.tsv.xz", sep="\t", index=False)

node_list

Unnamed: 0_level_0,node_name,popularity,node_type,year
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),53059,0,1995.0
2,Jumanji (1995),22466,0,1995.0
3,Grumpier Old Men (1995),15111,0,1995.0
4,Waiting to Exhale (1995),2898,0,1995.0
5,Father of the Bride Part II (1995),14323,0,1995.0
...,...,...,...,...
1123,writing,49,1,
1124,wuxia,17,1,
1125,wwii,73,1,
1126,zombie,81,1,


In [44]:
weighted_edges.to_csv("movie_lens_tag_genome_2014_edge_list.tsv.xz", sep="\t", index=False)

weighted_edges

Unnamed: 0,source,destination,edge_weight
0,0,9734,0.032
1,0,9735,0.035
2,0,9736,0.070
3,0,9737,0.114
4,0,9738,0.105
...,...,...,...
10979947,9733,10857,0.327
10979948,9733,10858,0.030
10979949,9733,10859,0.006
10979950,9733,10860,0.161


I have uploaded these files to internet archive here.

## MovieLens 1M Dataset
[MovieLens 1M movie ratings](https://grouplens.org/datasets/movielens/1m/). Stable benchmark dataset. 1 million ratings from 6000 users on 4000 movies. Released 2/2003.

In [5]:
!tree downloads/ml-1m/ml-1m/

[01;34mdownloads/ml-1m/ml-1m/[0m
├── [00mREADME[0m
├── [00mmovies.dat[0m
├── [00mratings.dat[0m
└── [00musers.dat[0m

0 directories, 4 files


Let's see what each of these files looks like, [for which there is some documentation here](https://files.grouplens.org/datasets/movielens/ml-1m-README.txt).

In [11]:
movies = pd.read_csv(
    "downloads/ml-1m/ml-1m/movies.dat",
    sep="::",
    engine="python",
    encoding='ISO-8859-1',
    header=None,
    index_col=0
)

movies.columns = ["node_name", "node_type"]

# We get the year info for the movies
movies["year"] = movies\
    .node_name.str.rsplit("(", n=1, expand=True)[1]\
    .str.strip(") ").astype(np.int16)

# We need to add a catarectizing node type
movies["node_type"] = [
    "|".join(["Movie"] + node_types.split("|"))
    for node_types in movies["node_type"]
]

# We create a densified range, as the default IDs are not a dense range
movies["node_id"] = np.arange(movies.shape[0])

movies

Unnamed: 0_level_0,node_name,node_type,year,node_id
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Movie|Animation|Children's|Comedy,1995,0
2,Jumanji (1995),Movie|Adventure|Children's|Fantasy,1995,1
3,Grumpier Old Men (1995),Movie|Comedy|Romance,1995,2
4,Waiting to Exhale (1995),Movie|Comedy|Drama,1995,3
5,Father of the Bride Part II (1995),Movie|Comedy,1995,4
...,...,...,...,...
3948,Meet the Parents (2000),Movie|Comedy,2000,3878
3949,Requiem for a Dream (2000),Movie|Drama,2000,3879
3950,Tigerland (2000),Movie|Drama,2000,3880
3951,Two Family House (2000),Movie|Drama,2000,3881


In [15]:
users = pd.read_csv(
    "downloads/ml-1m/ml-1m/users.dat",
    sep="::",
    engine="python",
    encoding='ISO-8859-1',
    header=None,
    index_col=0
)

node_types = [
    "other",
    "academic/educator",
    "artist",
    "clerical/admin",
    "college/grad student",
    "customer service",
    "doctor/health care",
    "executive/managerial",
    "farmer",
    "homemaker",
    "K-12 student",
    "lawyer",
    "programmer",
    "retired",
    "sales/marketing",
    "scientist",
    "self-employed",
    "technician/engineer",
    "tradesman/craftsman",
    "unemployed",
    "writer"
]

users.columns = ["gender", "age", "node_type", "zip_code"]

nomi = pgeocode.Nominatim('us')

users["node_type"] = [
    "|".join(["User", "Male" if gender == "M" else "Female", node_types[node_type]])
    for gender, node_type in zip(
        users["gender"],
        users["node_type"]
    )
]

# We create a densified range, as the default IDs are not a dense range
users["node_id"] = np.arange(users.shape[0])
# Since we do not have a node name, we might as well use a number
users["node_name"] = np.arange(users.shape[0])

users = pd.concat(
    [
        users,
        pd.DataFrame(
            [
                nomi.query_postal_code(zip_code).to_dict()
                for zip_code in tqdm(users["zip_code"])
            ],
            index=users.index
        )
    ],
    axis=1
)

users

  0%|          | 0/6040 [00:00<?, ?it/s]

Unnamed: 0_level_0,gender,age,node_type,zip_code,node_id,node_name,postal_code,country_code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,F,1,User|Female|K-12 student,48067,0,0,48067,US,Royal Oak,Michigan,MI,Oakland,125.0,,,42.4906,-83.1366,4.0
2,M,56,User|Male|self-employed,70072,1,1,70072,US,Marrero,Louisiana,LA,Jefferson Parish,51.0,,,29.8598,-90.1105,4.0
3,M,25,User|Male|scientist,55117,2,2,55117,US,Saint Paul,Minnesota,MN,Ramsey,123.0,,,44.9995,-93.0969,4.0
4,M,45,User|Male|executive/managerial,02460,3,3,02460,US,Newtonville,Massachusetts,MA,Middlesex,17.0,,,42.3520,-71.2084,4.0
5,M,25,User|Male|writer,55455,4,4,55455,US,Minneapolis,Minnesota,MN,Hennepin,53.0,,,44.9735,-93.2331,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,F,25,User|Female|scientist,32603,6035,6035,32603,US,Gainesville,Florida,FL,Alachua,1.0,,,29.6515,-82.3493,4.0
6037,F,45,User|Female|academic/educator,76006,6036,6036,76006,US,Arlington,Texas,TX,Tarrant,439.0,,,32.7785,-97.0834,4.0
6038,F,56,User|Female|academic/educator,14706,6037,6037,14706,US,Allegany,New York,NY,Cattaraugus,9.0,,,42.0918,-78.4999,4.0
6039,F,45,User|Female|other,01060,6038,6038,01060,US,Northampton,Massachusetts,MA,Hampshire,15.0,,,42.3223,-72.6313,4.0


In [16]:
node_list = pd.concat([
    users,
    movies
])
node_list

Unnamed: 0_level_0,gender,age,node_type,zip_code,node_id,node_name,postal_code,country_code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy,year
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,F,1.0,User|Female|K-12 student,48067,0,0,48067,US,Royal Oak,Michigan,MI,Oakland,125.0,,,42.4906,-83.1366,4.0,
2,M,56.0,User|Male|self-employed,70072,1,1,70072,US,Marrero,Louisiana,LA,Jefferson Parish,51.0,,,29.8598,-90.1105,4.0,
3,M,25.0,User|Male|scientist,55117,2,2,55117,US,Saint Paul,Minnesota,MN,Ramsey,123.0,,,44.9995,-93.0969,4.0,
4,M,45.0,User|Male|executive/managerial,02460,3,3,02460,US,Newtonville,Massachusetts,MA,Middlesex,17.0,,,42.3520,-71.2084,4.0,
5,M,25.0,User|Male|writer,55455,4,4,55455,US,Minneapolis,Minnesota,MN,Hennepin,53.0,,,44.9735,-93.2331,4.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,,,Movie|Comedy,,3878,Meet the Parents (2000),,,,,,,,,,,,,2000.0
3949,,,Movie|Drama,,3879,Requiem for a Dream (2000),,,,,,,,,,,,,2000.0
3950,,,Movie|Drama,,3880,Tigerland (2000),,,,,,,,,,,,,2000.0
3951,,,Movie|Drama,,3881,Two Family House (2000),,,,,,,,,,,,,2000.0


In [19]:
rating = pd.read_csv(
    "downloads/ml-1m/ml-1m/ratings.dat",
    sep="::",
    engine="python",
    encoding='ISO-8859-1',
    header=None,
    #index_col=0
)

# UserID::MovieID::Rating::Timestamp
rating.columns = ["source", "destination", "rating", "timestamp"]

# Remapping sources to the densified range
rating["source"] = users.node_id.loc[rating.source].values

# Remapping destinations to the densified range
rating["destination"] = users.shape[0] + movies.node_id.loc[rating.destination].values

rating

Unnamed: 0,source,destination,rating,timestamp
0,0,7216,5,978300760
1,0,6695,3,978302109
2,0,6942,3,978301968
3,0,9379,4,978300275
4,0,8326,5,978824291
...,...,...,...,...
1000204,6039,7115,1,956716541
1000205,6039,7118,5,956704887
1000206,6039,6598,5,956704746
1000207,6039,7120,4,956715648


In [21]:
node_list.to_csv("movie_lens_ml_1m_node_list.tsv.xz", sep="\t", index=False)
rating.to_csv("movie_lens_ml_1m_edge_list.tsv.xz", sep="\t", index=False)

## MovieLens 10M Dataset
[MovieLens 10M movie ratings](https://grouplens.org/datasets/movielens/10m/). Stable benchmark dataset. 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users. Released 1/2009.

In [22]:
!tree downloads/ml-10m/ml-10M100K/

[01;34mdownloads/ml-10m/ml-10M100K/[0m
├── [00mREADME.html[0m
├── [00mallbut.pl[0m
├── [00mmovies.dat[0m
├── [00mratings.dat[0m
├── [00msplit_ratings.sh[0m
└── [00mtags.dat[0m

0 directories, 6 files


In [29]:
movies = pd.read_csv(
    "downloads/ml-10m/ml-10M100K/movies.dat",
    sep="::",
    engine="python",
    encoding='ISO-8859-1',
    header=None,
    index_col=0
)

movies.columns = ["node_name", "node_type"]

# We get the year info for the movies
movies["year"] = movies\
    .node_name.str.rsplit("(", n=1, expand=True)[1]\
    .str.strip(") ").astype(np.int16)

# We need to add a catarectizing node type
movies["node_type"] = [
    "|".join(["Movie"] + node_types.split("|"))
    for node_types in movies["node_type"]
]

# We create a densified range, as the default IDs are not a dense range
movies["node_id"] = np.arange(movies.shape[0])

movies

Unnamed: 0_level_0,node_name,node_type,year,node_id
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Movie|Adventure|Animation|Children|Comedy|Fantasy,1995,0
2,Jumanji (1995),Movie|Adventure|Children|Fantasy,1995,1
3,Grumpier Old Men (1995),Movie|Comedy|Romance,1995,2
4,Waiting to Exhale (1995),Movie|Comedy|Drama|Romance,1995,3
5,Father of the Bride Part II (1995),Movie|Comedy,1995,4
...,...,...,...,...
65088,Bedtime Stories (2008),Movie|Adventure|Children|Comedy,2008,10676
65091,Manhattan Melodrama (1934),Movie|Crime|Drama|Romance,1934,10677
65126,Choke (2008),Movie|Comedy|Drama,2008,10678
65130,Revolutionary Road (2008),Movie|Drama|Romance,2008,10679


In [45]:
users = pd.DataFrame({
    "node_name": np.arange(71557),
    "node_type": "User"
})
users

Unnamed: 0,node_name,node_type
0,0,User
1,1,User
2,2,User
3,3,User
4,4,User
...,...,...
71552,71552,User
71553,71553,User
71554,71554,User
71555,71555,User


In [47]:
# We offset the movies ids:

movies.node_id += users.shape[0]

In [49]:
node_list = pd.concat([
    users,
    movies[["node_name", "node_type", "year"]]
])

node_list

Unnamed: 0,node_name,node_type,year
0,0,User,
1,1,User,
2,2,User,
3,3,User,
4,4,User,
...,...,...,...
65088,Bedtime Stories (2008),Movie|Adventure|Children|Comedy,2008.0
65091,Manhattan Melodrama (1934),Movie|Crime|Drama|Romance,1934.0
65126,Choke (2008),Movie|Comedy|Drama,2008.0
65130,Revolutionary Road (2008),Movie|Drama|Romance,2008.0


In [50]:
%%time
# UserID::MovieID::Rating::Timestamp

rating = pd.read_csv(
    "downloads/ml-10m/ml-10M100K/ratings.dat",
    sep="::",
    engine="python",
    encoding='ISO-8859-1',
    header=None,
    dtype={
        0: np.uint32,
        1: np.uint16,
        2: np.uint8,
        3: np.uint32
    }
    #index_col=0
)

rating.columns = ["source", "destination", "rating", "timestamp"]

tags["source"] -= 1

# Remapping destination to the densified range
rating["destination"] = movies.node_id.loc[rating.destination].values

rating

CPU times: user 38.7 s, sys: 1.25 s, total: 39.9 s
Wall time: 40.5 s


Unnamed: 0,source,destination,rating,timestamp
0,1,71677,5,838985046
1,1,71740,5,838983525
2,1,71785,5,838983392
3,1,71846,5,838983421
4,1,71870,5,838983392
...,...,...,...,...
10000049,71567,73580,1,912580553
10000050,71567,73599,2,912649143
10000051,71567,73767,5,912577968
10000052,71567,73811,2,912578016


In [51]:
%%time
# UserID::MovieID::Rating::Timestamp

tags = pd.read_csv(
    "downloads/ml-10m/ml-10M100K/tags.dat",
    sep="::",
    engine="python",
    encoding='ISO-8859-1',
    header=None,
    dtype={
        0: np.uint32,
        1: np.uint16,
        2: str,
        3: np.uint32
    }
    #index_col=0
)

tags.columns = ["source", "destination", "tag", "timestamp"]

# Remapping sources to the densified range
tags["source"] -= 1

# Remapping destination to the densified range
tags["destination"] = movies.node_id.loc[tags.destination].values

tags

CPU times: user 395 ms, sys: 17.5 ms, total: 412 ms
Wall time: 414 ms


Unnamed: 0,source,destination,tag,timestamp
0,14,76436,excellent!,1215184630
1,19,73240,politics,1188263867
2,19,73240,satire,1188263867
3,19,73897,chick flick 212,1188263835
4,19,73897,hanks,1188263835
...,...,...,...,...
95575,71555,72903,Gothic,1188263571
95576,71555,73897,chick flick,1188263606
95577,71555,74505,comedy,1188263626
95578,71555,74553,Gothic,1188263565


In [52]:
node_list.to_csv("movie_lens_ml_10m_node_list.tsv.xz", sep="\t", index=False)
tags.to_csv("movie_lens_ml_10m_edge_list.tsv.xz", sep="\t", index=False)
rating.to_csv("movie_lens_ml_10m_edge_list.tsv.xz", sep="\t", index=False)

## MovieLens 20M Dataset
[MovieLens 20M movie ratings](https://grouplens.org/datasets/movielens/20m/). Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.

In [53]:
!tree downloads/ml-20m/ml-20m/

[01;34mdownloads/ml-20m/ml-20m/[0m
├── [00mREADME.txt[0m
├── [00mgenome-scores.csv[0m
├── [00mgenome-tags.csv[0m
├── [00mlinks.csv[0m
├── [00mmovies.csv[0m
├── [00mratings.csv[0m
└── [00mtags.csv[0m

0 directories, 7 files


There exists documentation about the content of [these files here](https://files.grouplens.org/datasets/movielens/ml-20m-README.html).

In [92]:
# movieId,title,genres
movies = pd.read_csv("downloads/ml-20m/ml-20m/movies.csv", index_col=0)

movies.columns = ["node_name", "node_type"]

# Some years are missing, and this is not
# some hard to get info. Might as well add them
# myself.
missing_years = {
    'Babylon 5': 1994,
    'Brazil: In the Shadow of the Stadiums': 2014,
    'Slaying the Badger': 2014,
    'Tatort: Im Schmerz geboren': 2014,
    'National Theatre Live: Frankenstein': 2011,
    'The Court-Martial of Jackie Robinson': 1990,
    'In Our Garden': 2002,
    'Stephen Fry In America - New World': 2008,
    'Two: The Story of Roman & Nyro': 2013,
    "Li'l Quinquin": 2014,
    'A Year Along the Abandoned Road': 1991,
    'Body/Cialo': 2015,
    'Polskie gówno': 2015,
    'The Third Reich: The Rise & Fall': 2010,
    'My Own Man': 2014,
    'Moving Alan': 2003,
    'Michael Laudrup - en Fodboldspiller': 1993,
    "Millions Game, The (Das Millionenspiel)": 1970,
    "Bicycle, Spoon, Apple (Bicicleta, cullera, poma)": 2010
}

movies["node_name"] = [
    "{node_name} ({year})".format(
        node_name=node_name,
        year=missing_years[node_name]
    ) if node_name in missing_years else node_name
    for node_name in movies.node_name
]

# We get the year info for the movies
movies["year"] = movies\
    .node_name.str.rsplit("(", n=1, expand=True)[1]\
    .str.strip(") -").astype(np.uint16)

# We need to add a catarectizing node type
movies["node_type"] = [
    "|".join(["Movie"] + node_types.split("|"))
    for node_types in movies["node_type"]
]

movies

Unnamed: 0_level_0,node_name,node_type,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Movie|Adventure|Animation|Children|Comedy|Fantasy,1995
2,Jumanji (1995),Movie|Adventure|Children|Fantasy,1995
3,Grumpier Old Men (1995),Movie|Comedy|Romance,1995
4,Waiting to Exhale (1995),Movie|Comedy|Drama|Romance,1995
5,Father of the Bride Part II (1995),Movie|Comedy,1995
...,...,...,...
131254,Kein Bund für's Leben (2007),Movie|Comedy,2007
131256,"Feuer, Eis & Dosenbier (2002)",Movie|Comedy,2002
131258,The Pirates (2014),Movie|Adventure,2014
131260,Rentun Ruusu (2001),Movie|(no genres listed),2001


In [72]:
# tagId,tag
tag_names = pd.read_csv("downloads/ml-20m/ml-20m/genome-tags.csv")

tag_names

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
...,...,...
1123,1124,writing
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie


This seems to be a rather weird conjoined edge list.

In [68]:
# movieId,imdbId,tmdbId
links = pd.read_csv("downloads/ml-20m/ml-20m/links.csv")

links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
27273,131254,466713,4436.0
27274,131256,277703,9274.0
27275,131258,3485166,285213.0
27276,131260,249110,32099.0


In [69]:
# movieId,tagId,relevance
scores = pd.read_csv("downloads/ml-20m/ml-20m/genome-scores.csv")

# Add edge type movie to tag

scores

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02500
1,1,2,0.02500
2,1,3,0.05775
3,1,4,0.09675
4,1,5,0.14675
...,...,...,...
11709763,131170,1124,0.58775
11709764,131170,1125,0.01075
11709765,131170,1126,0.01575
11709766,131170,1127,0.11450


In [71]:
# userId,movieId,rating,timestamp
ratings = pd.read_csv(
    "downloads/ml-20m/ml-20m/ratings.csv",
    dtype=dict(
        userId=np.uint32,
        movieId=np.uint32,
        rating=np.float16,
        timestamp=np.uint32
    )
)

# Add edge type user to movie

ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
...,...,...,...,...
20000258,138493,68954,4.5,1258126920
20000259,138493,69526,4.5,1259865108
20000260,138493,69644,3.0,1260209457
20000261,138493,70286,5.0,1258126944


In [70]:
# userId,movieId,tag,timestamp
tags = pd.read_csv(
    "downloads/ml-20m/ml-20m/tags.csv",
    dtype=dict(
        userId=np.uint32,
        movieId=np.uint32,
        rating=str,
        timestamp=np.uint32
    )
)

# Add edge type user to movie

tags

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078
...,...,...,...,...
465559,138446,55999,dragged,1358983772
465560,138446,55999,Jason Bateman,1358983778
465561,138446,55999,quirky,1358983778
465562,138446,55999,sad,1358983772
