## For real data in movie lens

In [1]:
import numpy as np
import torch
import matplotlib.pyplot as plt
from pathlib import Path
import seaborn as sns
%matplotlib inline

In [2]:
datDir = Path("./MovieLens/")

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
- Not all the movies have a ratings

User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy.  Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

In [3]:
ratings = np.genfromtxt(datDir/"ratings.dat", delimiter="::")

In [4]:
locs = ratings[:, 0:2].astype(np.int) - 1
scs = ratings[:, 2].astype(np.int)

In [5]:
Ymat = torch.sparse_coo_tensor(locs.T, scs, size=(6040, 3952))

In [6]:
R = (Ymat.to_dense() != 0).float().to_sparse()

### Obtain Covariates

I choose 1 from user and 1 from movies

- Gender, 1 if female

- Genres, 1 if Drama

In [7]:
Users = np.genfromtxt(datDir/"users.dat", delimiter="::",dtype=str)

In [8]:
filMVs = datDir/"movies.dat"
with open(filMVs, "r", errors="ignore") as f:
    MVsRaw = f.readlines()
MVs = [MVRaw.split("::")[0:3:2] for MVRaw in MVsRaw]

In [9]:
# the mv ID which has no info for genre
mvIDs = [int(MV[0]) for MV in MVs]
fullMvIDs = list(range(1, 3953))
rmIds = np.setdiff1d(fullMvIDs, mvIDs)
rmIds

array([  91,  221,  323,  622,  646,  677,  686,  689,  740,  817,  883,
        995, 1048, 1072, 1074, 1182, 1195, 1229, 1239, 1338, 1402, 1403,
       1418, 1435, 1451, 1452, 1469, 1478, 1481, 1491, 1492, 1505, 1506,
       1512, 1521, 1530, 1536, 1540, 1560, 1576, 1607, 1618, 1634, 1637,
       1638, 1691, 1700, 1712, 1736, 1737, 1745, 1751, 1761, 1763, 1766,
       1775, 1778, 1786, 1790, 1800, 1802, 1803, 1808, 1813, 1818, 1823,
       1828, 1838, 3815])

In [10]:
# adjust Y accordingly
Yarr = np.delete(Ymat.to_dense().cpu().numpy(), rmIds-1, axis=1)
Ymat = torch.tensor(Yarr).to_sparse()
R = (Ymat.to_dense() != 0).float().to_sparse()

In [13]:
print(f"The final data is of size {Ymat.shape}.")

The final data is of size torch.Size([6040, 3883]).


#### Count of movie types

In [11]:
MVsType = [MV[1].strip().split("|") for MV in MVs]
cTypes = set(list(np.concatenate(MVsType)))
for cType in cTypes:
    print(cType, len([1  for MVType in MVsType if cType in MVType]))

Documentary 127
Adventure 283
Animation 105
Sci-Fi 276
Comedy 1200
Crime 211
Children's 251
Fantasy 68
Musical 114
Mystery 106
Romance 471
War 143
Horror 343
Film-Noir 44
Western 68
Action 503
Thriller 492
Drama 1603


In [12]:
# if Drama, then 1
selType = "Drama"
mvXs = [selType in MV[1].strip().split("|") for MV in MVs]
mvXs = np.array(mvXs).astype(np.int)
mvXs.mean()

0.4128251352047386

In [13]:
# if Female, then 1
userXs = np.array([user[1]=="F" for user in Users]).astype(np.int)
userXs.mean()

0.28294701986754967

In [14]:
X1 = np.repeat(userXs.reshape(-1, 1), R.shape[1], axis=1)
X2 = np.repeat(mvXs.reshape(1, -1), R.shape[0], axis=0)
Xarr = np.array([X1, X2]).transpose((1, 2, 0))

In [15]:
X = torch.tensor(Xarr).to_sparse()

### Check whether the dataset is correct or not

In [215]:
def MVID2idx(tMVID):
    if tMVID in rmIds:
        tMVidx = None
    else:
        tMVidx = tMVID - 1 - np.sum((rmIds - tMVID)< 0)
    return tMVidx

In [255]:
tUID, tMVID = ratings[200000, :2].astype(int)
tUID, tMVID = 3000, 3700

In [256]:
tX1 = [int(User[1]=="F") for User in Users if int(User[0])==tUID][0]
tX2Tp = [MV[1] for MV in MVs if int(MV[0]) == tMVID]
tX2 = int(selType in tX2Tp[0])
tLoc = (ratings[:, 0] == tUID).astype(np.int) +  (ratings[:, 1] == tMVID).astype(np.int)
if np.sum(tLoc == 2)==0:
    tY = 0
    tR = 0
else:
    info = ratings[tLoc==2, :][0]
    tY = info[2]
    tR=1

In [257]:
if MVID2idx(tMVID) is None:
    tdX = None
    tdY = None
    tdR = None
else:
    tdX = X.to_dense()[tUID-1, MVID2idx(tMVID)]
    tdY = Ymat.to_dense()[tUID-1, MVID2idx(tMVID)]
    tdR = R.to_dense()[tUID-1, MVID2idx(tMVID)]

In [258]:
# results from orginal dataset
print(
    f"Under such case, \n"
    f"the X1 is {tX1}, \n"
    f"the X2 is {tX2} with type {tX2Tp},\n"
    f"the Y is {tY} and R is {tR}"
     )

Under such case, 
the X1 is 0, 
the X2 is 1 with type ['Drama|Sci-Fi\n'],
the Y is 0 and R is 0


In [259]:
# results from cleaned dataset
print(
    f"Under such case in the data, \n"
    f"the Xs are {tdX.cpu().numpy()}, \n"
    f"the Y is {tdY} and R is {tdR}"
     )

Under such case in the data, 
the Xs are [0 1], 
the Y is 0 and R is 0.0


###  Save the results

In [18]:
import pickle

In [20]:
with open(datDir/"mvlensX.pkl", "wb") as f:
    pickle.dump(X, f)

In [21]:
with open(datDir/"mvlensY.pkl", "wb") as f:
    pickle.dump(Ymat, f)

In [23]:
with open(datDir/"mvlensR.pkl", "wb") as f:
    pickle.dump(R, f)