# Gowalla Dataset

[Gowalla dataset](https://snap.stanford.edu/data/loc-gowalla.html) contains user-venue checkins. This is the script that pre-processes the full dataset and splits it into non-overlapping training, validation, test sets. The data is used in the paper: ["modeling user exposure in recommendation"](http://arxiv.org/abs/1510.07025).

In [1]:
import json
import os

import numpy as np
import pandas as pd

Change this to wherever you keep the [processed data](http://dawenl.github.io/data/gowalla_pro.zip)

In [2]:
DATA_DIR = '/home/waldorf/dawen.liang/gowalla_pro'

In [3]:
df = pd.read_table(os.path.join(DATA_DIR, 'gwl_checkins.tsv'), header=None, sep='\t', names=['uid', 'sid', 'rating'])

In [4]:
df

Unnamed: 0,uid,sid,rating
0,0,22847,1
1,0,420315,1
2,0,316637,1
3,0,16516,1
4,0,5535878,1
5,0,15372,1
6,0,21714,1
7,0,420315,1
8,0,153505,1
9,0,420315,1


In [5]:
def get_count(df, id):
    playcount_groupbyid = df[[id, 'rating']].groupby(id, as_index=False)
    count = playcount_groupbyid.size()
    return count

def filter_triplets(df, min_sc=20):
    # Only keep the triplets for songs which were listened to by at least min_sc users. 
    songcount = get_count(df, 'sid')
    df = df[df['sid'].isin(songcount.index[songcount >= min_sc])]
    
    # Update both usercount and songcount after filtering
    usercount, songcount = get_count(df, 'uid'), get_count(df, 'sid') 
    return df, usercount, songcount

In [6]:
df, usercount, songcount = filter_triplets(df)

In [7]:
sparsity_level = float(df.shape[0]) / (usercount.shape[0] * songcount.shape[0])
print "After filtering, there are %d triplets from %d users and %d venues (sparsity level %.3f%%)" % (df.shape[0], 
                                                                                                      usercount.shape[0], 
                                                                                                      songcount.shape[0], 
                                                                                                      sparsity_level * 100)

After filtering, there are 2318616 triplets from 57629 users and 47198 venues (sparsity level 0.085%)


In [8]:
unique_uid = sorted(pd.unique(df['uid']))
unique_sid = sorted(pd.unique(df['sid']))

In [9]:
uid2idx = dict((uid, idx) for (idx, uid) in enumerate(unique_uid))
sid2idx = dict((sid, idx) for (idx, sid) in enumerate(unique_sid))

In [10]:
with open(os.path.join(DATA_DIR, 'sid2idx.json'), 'w') as f:
    json.dump(sid2idx, f)

In [11]:
with open(os.path.join(DATA_DIR, 'uid2idx.json'), 'w') as f:
    json.dump(uid2idx, f)

In [12]:
with open(os.path.join(DATA_DIR, 'unique_uid.txt'), 'w') as f:
    for uid in unique_uid:
        f.write('%s\n' % uid)

In [13]:
with open(os.path.join(DATA_DIR, 'unique_sid.txt'), 'w') as f:
    for sid in unique_sid:
        f.write('%s\n' % sid)

## Generate train/test/vad sets

Pick out 20% of the checkins for heldout test

In [14]:
np.random.seed(12345)
n_ratings = df.shape[0]
test = np.random.choice(n_ratings, size=int(0.20 * n_ratings), replace=False)

In [15]:
test_idx = np.zeros(n_ratings, dtype=bool)
test_idx[test] = True

test_df = df[test_idx]
train_df = df[~test_idx]

Make sure there is no empty row/column in the training data

In [17]:
print "There are total of %d unique users in the training set and %d unique users in the entire dataset" % \
(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))

There are total of 57095 unique users in the training set and 57629 unique users in the entire dataset


In [18]:
print "There are total of %d unique items in the training set and %d unique items in the entire dataset" % \
(len(pd.unique(train_df['sid'])), len(pd.unique(df['sid'])))

There are total of 47198 unique items in the training set and 47198 unique items in the entire dataset


We can see the some of the users do not have any checkins in the training set, so we move those users from test set

In [19]:
train_uid = set(pd.unique(train_df['uid']))

In [20]:
left_uid = list()
for i, uid in enumerate(pd.unique(df['uid'])):
    if uid not in train_uid:
        left_uid.append(uid)

In [21]:
move_idx = test_df['uid'].isin(left_uid)

In [22]:
train_df = train_df.append(test_df[move_idx])
test_df = test_df[~move_idx]

In [23]:
# make sure we are good
print "There are total of %d unique users in the training set and %d unique users in the entire dataset" % \
(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))

There are total of 57629 unique users in the training set and 57629 unique users in the entire dataset


Pick out 10% of the training rating as validation set

In [24]:
np.random.seed(13579)
n_ratings = train_df.shape[0]
vad = np.random.choice(n_ratings, size=int(0.10 * n_ratings), replace=False)

In [25]:
vad_idx = np.zeros(n_ratings, dtype=bool)
vad_idx[vad] = True

vad_df = train_df[vad_idx]
train_df = train_df[~vad_idx]

Again make sure there is no empty row/column in the training data

In [26]:
print "There are total of %d unique users in the training set and %d unique users in the entire dataset" % \
(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))

There are total of 57294 unique users in the training set and 57629 unique users in the entire dataset


In [27]:
print "There are total of %d unique items in the training set and %d unique items in the entire dataset" % \
(len(pd.unique(train_df['sid'])), len(pd.unique(df['sid'])))

There are total of 47198 unique items in the training set and 47198 unique items in the entire dataset


We can see the some of the users do not have any checkins in the training set, so we move those users from validation set

In [28]:
train_uid = set(pd.unique(train_df['uid']))

In [29]:
left_uid = list()
for i, uid in enumerate(pd.unique(df['uid'])):
    if uid not in train_uid:
        left_uid.append(uid)

In [30]:
move_idx = vad_df['uid'].isin(left_uid)

In [31]:
train_df = train_df.append(vad_df[move_idx])
vad_df = vad_df[~move_idx]

In [32]:
print train_df.shape, vad_df.shape

(1670363, 3) (185185, 3)


In [33]:
# make sure we are good
print "There are total of %d unique users in the training set and %d unique users in the entire dataset" % \
(len(pd.unique(train_df['uid'])), len(pd.unique(df['uid'])))

There are total of 57629 unique users in the training set and 57629 unique users in the entire dataset


## Numerize the data into (user_index, item_index, count) format

In [34]:
uid = map(lambda x: uid2idx[x], train_df['uid'])
sid = map(lambda x: sid2idx[x], train_df['sid'])

In [35]:
train_df['uid'] = uid
train_df['sid'] = sid

In [36]:
train_df.to_csv(os.path.join(DATA_DIR, 'train.num.csv'), index=False)

In [37]:
uid = map(lambda x: uid2idx[x], test_df['uid'])
sid = map(lambda x: sid2idx[x], test_df['sid'])

In [38]:
test_df['uid'] = uid
test_df['sid'] = sid

In [39]:
test_df.to_csv(os.path.join(DATA_DIR, 'test.num.csv'), index=False)

In [40]:
uid = map(lambda x: uid2idx[x], vad_df['uid'])
sid = map(lambda x: sid2idx[x], vad_df['sid'])

In [41]:
vad_df['uid'] = uid
vad_df['sid'] = sid

In [42]:
vad_df.to_csv(os.path.join(DATA_DIR, 'vad.num.csv'), index=False)