# Final process of movielens dataset
* Uses MovieLens data available [here](https://grouplens.org/datasets/movielens/10m/) at the time of writing.
* Generate file assigning uid to original uid: uid_to_uid.csv
* Generate file assigning iid to original artist-song pair: iid_to_movie_genre.csv
* Generate 3 folders: train, validation, test
* Each of the folders should in the end have X.npy, y.npy, seq_lens.npy, user_ids.npy (the last one is not explicitly needed for training but may be useful for debugging)
* The order of things is as follows:
    * Assign unique ids to users and items (keep track of original values - make one column for movie+genre)
    * Convert time to unix epochs
    * Remove users with 2 or fewer interactions
    * Sort each user's interaction by time so that the first thing that happened is also placed first, break ties nonrandomly
    * Add delta_t by removing the first interaction for each user
    * Make remaining items to be sequential - record to iid_to_movie_genre.csv
        * Do that only after removing items that are not in train. This has to be done a high % of items are removed from val-test - they would be random noise that would also lead to possibly noticeable memory waste in the embedding matrix. Thus from the original DataFrame all unique iid-movie-genre combinations are obtained, joined together with the train_df from which all the unique items used in the experiments are obtained. A new factorised column is created and the conversion from these new indices to relevant movie-genre pairs is made. The new table is then joined with the train_df, val_df and test_df again on iid and thus new index is supplied to these DataFrames.
    * Split into train-validation-test with overhangs of one item (for label)
        * Apply same logic as in dataset.py
    * for each subset:
        * Remove items from validation and test if they are not present in train
        * Split into X,y
        * Place into numpy arrays 20 interactions at a time, apply padding if needed
        * Obtain seq_lens and user_ids
        * Save

## Imports 

In [None]:
import pandas as pd
import numpy as np
import time
import datetime
from pathlib import Path
from IPython.display import display, HTML
from itertools import compress
import sys
import os
from importlib import reload

## Settings

In [None]:
project_root = Path("/Users/nknyazev/Documents/Delft/Thesis/temporal") # Specify your own project root
data_root = project_root.joinpath("data")
code_root = project_root.joinpath("code")
input_path = data_root.joinpath("original/ml-10m/ratings.dat")
iid_movie_genre_path = data_root.joinpath("original/ml-10m/movies.dat")
output_dir = data_root.joinpath("processed/final/ml-10m/")
input_columns = ["uid", "iid", "rating", "t"]

## Additional imports from own modules 

In [None]:
sys.path.append(str(code_root))
import model.utils.datasplit
reload(model.utils.datasplit)
from model.utils.datasplit import train_val_test_split_train_overlapping, remove_unseen_items_in_train, generate_big_hop_numpy_files

## Load dataset as pandas DataFrame

In [None]:
df = pd.read_csv(input_path, names=input_columns, sep="::", header=None, dtype=np.int32)

In [None]:
df.shape

### Preview of the data

In [None]:
display(df.head())

In [None]:
print("Number of unique users: {}".format(df["uid"].nunique()))
print("Number of unique items: {}".format(df["iid"].nunique()))

## Drop rating

In [None]:
df = df.drop(columns="rating")

## All interactions grouped by user id


In [None]:
grouped = df.groupby("uid")

## No users with fewer than 3 interactions 

In [None]:
# Each item is a list containing indices of interactions belonging to a short user
cols_for_users_under_3 = [grouped.groups[k] for k in grouped.groups.keys() if len(grouped.groups[k]) < 3]
# Rows in DataFrame to remove
flattened = [idx for user in cols_for_users_under_3 for idx in user]
print("Found {} short users summing to {} interactions".format(len(cols_for_users_under_3), len(flattened)))

## Sort each user in time (the ones happened longer ago first). Break ties non-randomly

### Function to reorder each user
Breaks ties non-randomly. First tries to order interactions by time (ascending) - many interactions were logged at the same exact time and thus using a second column (DataFrame index) to break ties - the item that appeared first in the file is also placed first.

In [None]:
df = df.reset_index().sort_values(by=["uid", "t", "index"]).drop(columns="index").reset_index(drop=True)

### Verify the time is now sorted

In [None]:
display(df.head())
print("Time is sorted in correct order")

## Calculate delta_t's 

### Array to keep track of indices of 1st interaction for each user - these indices will be removed


In [None]:
remove_interaction_indices = []

### Array to keep track of time deltas

In [None]:
time_deltas = []

### Process each user - this should produce the same total number of interactions but each user's first interaction will have NaN in time_deltas

In [None]:
grouped = df.groupby("uid")
for uid, interactions in grouped:
    if len(interactions) > 2:
        remove_interaction_indices.append(interactions.index[0])
        time_delta_with_na = interactions["t"] - interactions.shift(1)["t"]
        time_deltas.extend(time_delta_with_na)
    else:
        remove_interaction_indices.extend(interactions.index)
        print("Removed interactions directly.")
    if uid % 50 == 1:
        print("Completed user {}.".format(uid))

### Remove nan's

In [None]:
time_deltas_wo_na = list(compress(time_deltas, ~np.isnan(time_deltas)))

In [None]:
len(time_deltas_wo_na)

### Sanity Check: len of original df - number of nan's = len of new df

In [None]:
assert len(df) - len(remove_interaction_indices) == len(time_deltas_wo_na)

### Remove interactions without time deltas 

In [None]:
df = df.drop(remove_interaction_indices)

### Add time deltas to output

In [None]:
df["dt"] = np.array(time_deltas_wo_na, dtype=np.int32)

In [None]:
display(df[:20])

In [None]:
np.sum(df["dt"] == 0)

## Split data into three dataframes: train, validation, test - 0.9, 0.05, 0.05 of each user's sequence respectively
`train_val_test_split_train_overlapping` from `model.utils.datasplit` of this repo.

In [None]:
train_df, val_df, test_df = train_val_test_split_train_overlapping(df=df[["uid", "iid", "dt"]], 
                                                                   col_names=["uid", "iid", "dt"],
                                                                  split=[0.9, 0.05, 0.05])

In [None]:
print("Original DataFrame Length - {}\nResulting DataFrame lengths:\nTrain - {}\nValidation - {}\nTest - {}\nTotal lengths - {}".format(len(df), len(train_df), len(val_df), len(test_df), len(train_df)+len(val_df)+len(test_df)))

## For test/eval remove interactions with items not present in train

In [None]:
og_val_items = set(val_df["iid"])
og_ts_items = set(test_df["iid"])
og_val_ts_items = og_val_items.union(og_ts_items)

In [None]:
val_df = remove_unseen_items_in_train(train_df=train_df, test_df=val_df)
test_df = remove_unseen_items_in_train(train_df=train_df, test_df=test_df)

In [None]:
val_items = set(val_df["iid"])
ts_items = set(test_df["iid"])
val_ts_items = val_items.union(ts_items)
items_removed = len(og_val_ts_items)-len(val_ts_items)
items_in_original_df = df["iid"].nunique()
print("Removed {} unique items from train and validation, which is {} of the original dataset's items.".format(items_removed,round(items_removed/items_in_original_df, 2)))

In [None]:
print("Original DataFrame Length - {}\nResulting DataFrame lengths:\nTrain - {}\nValidation - {}\nTest - {}\nTotal lengths - {}".format(len(df), len(train_df), len(val_df), len(test_df), len(train_df)+len(val_df)+len(test_df)))

## Need to create linkage between item indices in train_df and external file

### External file

In [None]:
df2 = pd.read_csv(iid_movie_genre_path, header=None, names=["iid", "movie_name", "genre"], sep="::")

In [None]:
display(df2.head())

### DataFrame with unique items as indices, uid, t as values

In [None]:
unique_iid_train_df = train_df.groupby("iid").first()

### Merge the two above on the iid index

In [None]:
unique_iid_movie_genre = unique_iid_train_df.join(df2.set_index("iid"))[["movie_name", "genre"]]

### Create new column with factorized iid

In [None]:
unique_iid_movie_genre["new_iid"] = pd.factorize(unique_iid_movie_genre.index)[0]

In [None]:
unique_iid_movie_genre.tail()

### Save data to a separate file containing explanations what artist-song pair each item id stands for

#### Specify output path

In [None]:
unique_iid_movie_genre_path = output_dir.joinpath("iid_to_movie_genre.csv")

#### Save

In [None]:
unique_iid_movie_genre.to_csv(unique_iid_movie_genre_path, columns=["new_iid", "movie_name", "genre"], header=False, index=False, sep="\\")

### Join each of the train/validation/test DataFrames on iid

In [None]:
len(train_df)

In [None]:
train_df = train_df.join(unique_iid_movie_genre[["new_iid"]], on="iid").drop("iid", axis=1).rename(columns={"new_iid": "iid"})

In [None]:
len(train_df)

In [None]:
val_df = val_df.join(unique_iid_movie_genre[["new_iid"]], on="iid").drop("iid", axis=1).rename(columns={"new_iid": "iid"})

#### Verify before-after

In [None]:
test_df.tail()

In [None]:
test_df = test_df.join(unique_iid_movie_genre[["new_iid"]], on="iid").drop("iid", axis=1).rename(columns={"new_iid": "iid"})

In [None]:
test_df.tail()

## Create X, y, seq_lens and user_ids out of these three DataFrames

In [None]:
train_array = []
val_array = []
test_array = []
dfs = [train_df, val_df, test_df]
arrays = [train_array, val_array, test_array]

### Iterate over each of the three DataFrames and create 4 numpy arrays that are added to a list

In [None]:
for index in range(len(arrays)):
    dataframe = dfs[index]
    X, y, seq_lens = generate_big_hop_numpy_files(dataframe, features=["uid", "iid", "dt"], save=False)
    arrays[index].extend([X, y, seq_lens, X[:,0,0]])

    

## Save these into output_dir/{subset} as .npy files

In [None]:
subset_names = ["train", "validation", "test"]
file_types = ["X", "y", "seq_lens", "user_ids"]
file_names = [x + ".npy" for x in file_types]
for x in range(len(subset_names)):
    target_folder = output_dir.joinpath(subset_names[x])
    try:
        os.mkdir(str(target_folder))
    except FileExistsError:
        print("Folder {} already exists.".format(str(target_folder)))
    for y in range(len(arrays[x])):
        file_path = str(target_folder.joinpath(file_names[y]))
        print("Writing {}".format(file_path))
        np.save(file_path, arrays[x][y])
        