## Intro to the datasets

**EDIT: I deleted the GitHub dataset here since the description in the doc made me confused.**

In YouTube dataset, each line contains a user identifier followed by a group identifier (separated by a tab), implying that the user is a member of the group.

**user-num**: 94238

**group-num**:30087


## Randomly select 20000 users

As is proposed in the paper of BiNE, we should extract 15000 users and reserve all their edges to be our universal dataset. Since these 2 datasets are sparser than Wikipedia, we select 20000 users here.

After selection, the edge number:

* **YouTube**: 62098

Also, we want to rename all user identifiers by adding a letter 'u' before them, and all group/project identifiers by a letter 'i' so as to distinguish them.

Note that in YouTube dataset there is one extra line, so we use `next` to skip it.

In [1]:
import numpy as np

In [37]:
a = np.arange(1,94239)
b = np.arange(1, 120868)

In [47]:
rs1 = np.random.RandomState(111)
rs2 = np.random.RandomState(222)
#users_y = rs1.randint(low=1, high=94238, size = 20000)
#users_g = rs2.randint(low=1, high=120867, size = 20000)
rs1.shuffle(a)
rs2.shuffle(b)
users_y = a[:20000]
users_g = b[:20000]

In [48]:
with open('out.youtube-groupmemberships', 'r') as fin, open('youtube_selected.dat','w') as fout:
    next(fin)
    line_num = 0
    for line in fin:
        user, group = line.strip().split(' ')
        if int(user) in users_y:
            line_num += 1
            fout.write('u' + user + ' ' + 'i' + group + '\n')    

line_num

62098

## Split training and test set

As BiNE proposed, we should take 60% edges of each user as the training set, and the other 40% as the positive samples in the test set. For the test set, we need to sample an equal number of negative samples i.e. edges that does not exist.

### First we calculate each user's edge number


In [36]:
def user_cnt(filename):
    cnt = []
    
    i = 0
    first = True
    with open(filename, 'r') as fin:
        # with (user, item) heading added
        next(fin)
        for line in fin:
            user, group = line.strip().split(' ')
            # for the first line
            if first:
                cur = user
                first = False
                i = 0
        
            if user != cur:
                cnt.append(i) # count finish
                i = 1 # restart and count for 1
                cur = user  
            else:
                i += 1
    
        # the last entity
        cnt.append(i)
    
    return cnt

In [37]:
youtube_cnt = user_cnt('youtube_selected.dat')

In [38]:
len(youtube_cnt)

20000

### Then for each user, split 60% for Training set

Before using pandas `read_table`, I manually add a line in the file:
```
user item
```
to serve as the heading.

In [8]:
import pandas as pd

In [9]:
f = pd.read_table('youtube_selected.dat', sep = ' ')

In [13]:
f

Unnamed: 0,user,item
0,u5,i20
1,u5,i21
2,u5,i22
3,u11,i52
4,u13,i54
5,u13,i55
6,u13,i56
7,u19,i71
8,u19,i72
9,u19,i73


In [26]:
start, end = 0,0
pnt = 0
for i in youtube_cnt:
    pnt += 1
    start = end
    end = start + i
    cur = f.iloc[start:end, :]
    train = cur.sample(frac=0.6)
    test = pd.concat([cur,train]).drop_duplicates(keep=False)
    train.to_csv('youtube_train.dat', header=None, index=None, mode='a')
    test.to_csv('youtube_test.dat', header=None, index=None, mode='a')

## Add negative samples

Now we have a train set and a test set with only positive samples. The positive samples can be used to train LINE or our apporach i.e. B-LINE. However, in order to train a LR, we should add some negative samples. The number of negative entities should be equal with the total entities number in train and test set.

In [1]:
import networkx as nx
from networkx.algorithms import bipartite

In [2]:
g = nx.Graph()

In [3]:
with open('youtube_selected.dat', 'r') as fin:
    next(fin)
    for line in fin:
        user, group = line.strip().split(' ')
        g.add_node(user, bipartite = 0)
        g.add_node(group, bipartite = 1)
        g.add_edge(user, group)

In [4]:
top_nodes = {n for n, d in g.nodes(data=True) if d['bipartite']==0}

In [5]:
g.number_of_edges()

62098

In [6]:
len(top_nodes)

20000

In [7]:
bottom_nodes = set(g) - top_nodes

### Then we want to create a dictionary to map numbers to nodes

In [8]:
import numpy as np

In [9]:
users = list(top_nodes)
groups = list(bottom_nodes)

In [10]:
num = 41108 + 20990
num # the number we need

62098

In [32]:
np.ceil(num*2*0.6 - 41108)

33410.0

In [24]:
neg = []
rs = np.random.RandomState(427)
user_ids = rs.randint(len(users), size = num)

for user_id in user_ids:
    group_id = np.random.randint(len(groups))
    user = users[user_id]
    group = groups[group_id]
    while (user, group) in g.edges():
        group_id = np.random.randint(len(groups))
        group = groups[group_id]
    
    neg.append((user, group, 0))

In [26]:
pos = []
with open('youtube_train.dat', 'r') as fin:
    for line in fin:
        user, group = line.strip().split(',')
        pos.append((user, group, 1))

In [33]:
neg_train = neg[:33410]
neg_test = neg[33410:]

In [34]:
len(pos)

41108

In [35]:
pos.extend(neg_train)

In [38]:
np.random.shuffle(pos)

In [40]:
user, group, tag = pos[0]

In [41]:
user

'u18171'

In [43]:
with open('case_train.dat', 'w') as fin:
    for ent in pos:
        user, group, tag = ent
        fin.write('{} {} {}\n'.format(user, group, tag))

In [44]:
pos_test = []
with open('youtube_test.dat', 'r') as fin:
    for line in fin:
        user, group = line.strip().split(',')
        pos_test.append((user,group,1))

In [46]:
pos_test.extend(neg_test)

In [48]:
np.random.shuffle(pos_test)

In [50]:
with open('case_test.dat', 'w') as fin:
    for ent in pos_test:
        user, group, tag = ent
        fin.write('{} {} {}\n'.format(user, group, tag))