### Goal of the notebook

* The small version of the dataset has around 100K ratings on 9k movies from 700 users
* The full version of the dataset has around 25M ratings on 45k movies from 270k users

* In this notebook, I will try to generate a combination of the two graphs
    * keeping all the movies of the small graph
    * keeping all the ratings of the large graph on these movies
    * (obviously) keep only the users of the ratings I used

### Imports

In [1]:
import pathlib
import os
import sys
from collections import defaultdict
from statistics import mean
from py2neo import Graph
from py2neo.bulk import merge_nodes, merge_relationships
import random

parent_path = pathlib.Path(os.getcwd()).parent.absolute()
sys.path.append(str(parent_path))

from utils.general import read_csv, df_to_json
from tqdm import tqdm
from tabulate import tabulate


### Load CSVs

In [2]:
data_dir = "movies_with_metadata"

In [5]:
ratings_small = df_to_json(
    read_csv(
        filename="ratings_small",
        parent_dir_name=data_dir,
    )
)


Reading from: /Users/ioannisathanasiou/diploma/model/movies_with_metadata/ratings_small.csv


In [6]:
ratings_large = df_to_json(
    read_csv(
        filename="ratings",
        parent_dir_name=data_dir,
    )
)

Reading from: /Users/ioannisathanasiou/diploma/model/movies_with_metadata/ratings.csv


### Get the corresponding movies and ratings

In [21]:
movies_small_ids = set([rating["movieId"] for rating in ratings_small])
movies_large_ids = set([rating["movieId"] for rating in ratings_large])
len(movies_large_ids.difference(movies_small_ids))

36057

In [23]:
ratings_mixed = [rating for rating in ratings_large if rating["movieId"] in movies_small_ids]
print(len(ratings_large), "=>", len(ratings_mixed))

26024289 => 25107579


### Conclusion

* Only ~ 1M ratings were removed
* Not enough reduction of the dataset