## Preprocessing

here's what we are going to do:
   1) import packages. <br>
   2) read MovieLens-1m dataset.
   3) fix userId and MovieId. we will also delete timestamp column.
   4) change the dataset into the implicit dataset. here we use both positive and negative feedback with threshold=3.
   5) split our dataset into train and validation sets.
   6) change our train dataset (triplets of userId, movieId, and rating) into an interaction matrix. why? because some of the methods will use an interaction matrix for training.
   7) for some of the methods we also need to add another column to our triplet dataset which is the movieId that the user with userId doesn't have interaction with. we will call it negativeId.<br>

so in the end, we have 3 datasets:
   * interaction matrix R which will be saved as a sparse matrix.
   * quadruplet dataset which will be saved as pandas dataframe.
   * validation dataset which also will be saved as pandas dataframe.<br>

you can download various versions of the MovieLens dataset from here: https://grouplens.org/datasets/movielens/

### Import Packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import os
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import csr_array,load_npz
from scipy import sparse
from torch_geometric.utils import negative_sampling
import torch
from tqdm import tqdm

### Read Dataset

In [2]:
DATA_PATH = os.getcwd() + "\\ml-1m\\ratings.dat"

In [3]:
df = pd.read_csv(DATA_PATH,sep='::',names=['userId','movieId','rating','timestamp'])

  df = pd.read_csv(DATA_PATH,sep='::',names=['userId','movieId','rating','timestamp'])


In [4]:
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


### Fixing userId and MovieId

In [5]:
user_label_encoder = LabelEncoder()
movie_label_encoder = LabelEncoder()

In [6]:
df.userId = user_label_encoder.fit_transform(df.userId.values)
df.movieId = movie_label_encoder.fit_transform(df.movieId.values)

In [7]:
n_users = np.unique(df.userId.values).shape[0]
n_movies = np.unique(df.movieId.values).shape[0] 

In [8]:
df.drop(columns=['timestamp'],inplace=True)

### Change feedbacks into implicit

In [9]:
df.rating[df.rating.values <3] = -1
df.rating[df.rating.values >=3] = 1

### Train and validation split

In [10]:
df_train,df_val = train_test_split(df,test_size=0.2,stratify=df.rating.values,random_state=42)

In [11]:
df_val.to_csv('df_val.csv',index=False)

### Constructing interaction matrix

In [12]:
rows = np.array(df_train.iloc[:,0])
cols = np.array(df_train.iloc[:,1])
values = np.array(df_train.iloc[:,2])

In [13]:
R_train = csr_array((values,(rows,cols)),shape=(n_users,n_movies),dtype=np.int8)

In [14]:
sparse.save_npz('R_train.npz',R_train)

### Change triplet dataset into quadruplet

In [15]:
row,col = df_train.userId.values, df_train.movieId.values

In [16]:
edge_index = torch.stack([torch.tensor(row),torch.tensor(col)],dim=0)

In [17]:
neg = negative_sampling(edge_index,num_nodes=(n_users,n_movies),force_undirected=True)

In [18]:
neg_sample = torch.empty_like(edge_index[0])

In [19]:
for i in tqdm(range(edge_index[0].shape[0])):
    idx = np.random.choice(torch.where(neg[0]==edge_index[0][i])[0].numpy(),size=1)[0]
    neg_sample[i] = neg[1][idx]

100%|████████████████████████████████████████████████████████████████████████| 800167/800167 [10:52<00:00, 1226.52it/s]


In [20]:
quadruplet = df_train.copy()

In [21]:
quadruplet

Unnamed: 0,userId,movieId,rating
973180,5868,198,1
614205,3719,2961,1
987650,5960,3032,1
708993,4251,2651,1
294753,1751,1900,1
...,...,...,...
183330,1140,3321,1
484438,2981,796,-1
158760,1014,521,1
919703,5554,157,1


In [22]:
quadruplet['negativeId'] = neg_sample

In [23]:
quadruplet

Unnamed: 0,userId,movieId,rating,negativeId
973180,5868,198,1,2554
614205,3719,2961,1,3170
987650,5960,3032,1,1066
708993,4251,2651,1,1165
294753,1751,1900,1,1801
...,...,...,...,...
183330,1140,3321,1,573
484438,2981,796,-1,2725
158760,1014,521,1,2933
919703,5554,157,1,315


In [24]:
quadruplet.to_csv('quadruplet.csv',index=False)