<center><h1>TLDA</h1></center>

<img src="tlda.png">

1. `theta_u` : pattern distribution of user *u*.
2. `phi_t` : pattern distribution of time *t*.
3. `psi_z` : venue category distribution of pattern *z*
4. sfs

Analogous to the traditional LDA : `document` - `user/time`, `topic` - `cultural pattern`, and `word` - `venue`.

<h2>Data Extraction</h2>

1. `checkin_data` : {*user1* : [(*venue_category_1*, *time_1*), (*venue_category_2*, *time_2*),..., (*venue_category_n*, *time_n*)],*user2* : [(*venue_category_1*, *time_1*), (*venue_category_2*, *time_2*) ...], *user3*...}.
2. `venues` : list of venue ids.
3. `venue_categories` : list of venue categories.


In [57]:
from collections import defaultdict
import numpy as np

In [360]:
# Month(str): Jan Feb [Mar not included] Apr May Jun Jul Aug Sep Oct Nov Dec
# day: Mon, Tue, Wed, Thu, Fri, Sat, Sun
# Hour(int): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23


# data format: (user1 (venue_category_1, time_1), (venue_category_2, time_2), (venue_category_3, time_3))
# time format: month-day-hour; Example: (Oct-Wed-13)
checkin_data = defaultdict(list)

# store all venues
venues = set()

nyc = open("nyc.txt")

for checkin in nyc:
    checkin_time = (checkin.split('\t')[-1]).split(' ')
    
    # venues
    venues.add(checkin.split('\t')[1])
    
    # time in the correct format
    time = checkin_time[1] + '-' + checkin_time[0] + '-' + checkin_time[3].split(':')[0]
    
    # corresponding venue category
    venue_category = checkin.split('\t')[3]
    
    # one checkin tuple of this user
    single_checkin = (venue_category, time)

    #user id
    user = checkin.split('\t')[0]
    
    checkin_data[user].append(single_checkin)

In [59]:
print("Number of users: {}".format(len(checkin_data)))
venue_categories = set()

for user, checkins in checkin_data.items():
    for checkin in checkins:
        venue_categories.add(checkin[0])
print("Number of venue categories: {}".format(len(venue_categories)))

total_checkins = 0
for user, checkins in checkin_data.items():
    total_checkins += len(checkins)

print("Nnumber of venues: {}".format(len(venues)))
    
print("Number of checkins: {}".format(total_checkins))

max = 0
min = 5000
for user, checkins in checkin_data.items():
    if len(checkins) <= min:
        min = len(checkins)
    if len(checkins) >= max:
        max = len(checkins)
print("Maxmimum number of check-ins per user: {}".format(max))
print("Minimum number of check-ins per user: {}".format(min))

distinct_checkins = set()
for checkins in checkin_data.values():
    for checkin in checkins:
        distinct_checkins.add(checkin)
print("Number of distinct check-ins: {}".format(len(distinct_checkins)))

Number of users: 1083
Number of venue categories: 251
Nnumber of venues: 38333
Number of checkins: 227428
Maxmimum number of check-ins per user: 2697
Minimum number of check-ins per user: 100
Number of distinct check-ins: 81320


<h2> Basic info </h2>

1. `1083` users, with id from `1` to `1083`. 
2. `38333` venues.
3. `251` venue categories.
4. `227428` checkins
5. Maxmimum number of check-ins per user: `2697`
6. Minimum number of check-ins per user: `100`
7. Number of distinct check-ins: `81320`


<h2>What the author did</h2>

1. first filter cultural fans based on users with at least 20 check-ins. 
2. Besides a venue category label, also represent a temporal label for each cultural check-in with three levels of identifiers, including month of year (Oct), day of week (Fri), and hour of day (13). Following this form of expression, a user’s check-in history can be represented as `(User3, ((Concert hall, JulFri20), (Golf, OctSun10), (Yoga, AprFri18))`, for example.
3. Run the TLDA model with the optimum number of patterns *K* given by TCV. We adopt 7 numbers from 3 to 9 as candidates, run the TLDA for 100 iterations each, and get their respective average TCV scores. Select *K* with the highest score.

<h2> What we did with respect to points above</h2>

1. For our data set, the minimum number of checkins per user is 100 (the maximum is 2697), therefore, there is no need to trim the data.
2. We followed the author's approach, however, based on the heatmap, the cultural pattern with respect to day is not signification, therefore, we might change the time format to `(month-hour)` instand of `(month-day-hour)`.

<h2> My approach </h2>

The input to `lda` library has to be a document-term matrix `X` where `X_{ij}` = the number of times term at index `j` appears in document `i`. <br>
There are 1083 users, in other words, 1083 "documents", and 81320 distinct check-in data, in other words, 81320 "words". So we construct a 1083 by 81320 matrix as the input. <br>
Because users in the `checkin_data` is not ordered based on their ids, we have to create a user_id - index mapping. Similarily, we also create a checkin-index mapping. <br>
Then, after the LDA, we also need to map index back to users and checkins. Therefore, we also need a index-user_id mapping and index-checkin mapping which is exactly the reverse of two data structures above.

<h2> Potentianl issues </h2>

Here we adopt the notion: `user` - `document`, `cultural pattern` - `topic`, `checkin` - `words`. 
1. The issue concerns me the most this the low frequencies of all words. Within a document, it is less likely (stil possible) to have the same word occurs twice. Within the corpus which consists of 1083 documents, on average, a word appears four times cross these 1083 documents. This is not a very good ratio. 

In [337]:
import lda
import numpy as np
from collections import defaultdict

<h2> Construct the user_id and checkin mappings </h2>

In [265]:
#checkin_data : {user1 : [(venue_category_1, time_1), (venue_category_2, time_2), (venue_category_3, time_3...)], 
#                user2 : [(venue_category_13, time_13), (...) ...] ...}


# user - index (from 0 to 1082) mapping 
user_index_mapping = dict()

#index (from 0 to 1082) - user mapping
index_user_mapping = dict()

for index, user in enumerate(checkin_data.keys()):
    user_index_mapping[user] = index
    index_user_mapping[index] = user

# checkin - index (from 0 to 81319) mapping
checkin_index_mapping = dict()

# index (from 0 to 81319) - checkin mapping
index_checkin_mapping = dict()

index = 0
for checkins in checkin_data.values():
    for checkin in checkins:
        if checkin_index_mapping.get(checkin) == None:
            checkin_index_mapping[checkin] = index
            index_checkin_mapping[index] = checkin
            index += 1

<h2> Main user-checkin matrix </h2>

In [266]:
# construct and initialize the main matrix with all elements equal to zero
main_matrix = [[0] * 81320] * 1083
main_matrix = np.array(main_matrix)

# fill in the main matrix with data, suppose user1 has checkin data (venue_category_1, time_1), then element of main matrix at 
# index i = index_user_mapping[user1], j = index_checkin_mapping[(venue_category_1, time_1)] is 1
for user, checkins in checkin_data.items():
    i = user_index_mapping[user]
    for checkin in checkins:
        j = checkin_index_mapping[checkin]
        main_matrix[i,j] += 1 # it is possible that the same time occurs twice!!!!!!!!!!!!

In [339]:
# check if code is correct
index = 123
print(np.sum(main_matrix[index]))

user = index_user_mapping[index]
print(len(checkin_data[user]))

print("--------------------------------")

l1 = list()
l2 = set() # because there are duplicates
x = 0
for i in main_matrix[index]:
    if i != 0:
        l1.append(x)
    x += 1

for c in checkin_data[user]:
    l2.add(checkin_index_mapping[c])
    
l2 = list(l2)
print(sorted(l1) == sorted(l2))

431
431
--------------------------------
True


<h2> venue_category, month, day and hour mappings </h2>

In [361]:
venue_categories = set()
for user, checkins in checkin_data.items():
    for checkin in checkins:
        venue_categories.add(checkin[0])
venue_index_mapping = dict()
index_venue_mapping = dict()

for i, venue in enumerate(venue_categories):
    venue_index_mapping[venue] = i
    index_venue_mapping[i] = venue

In [362]:
# THERE IS NO MARCH IN THE DATA SET!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_index_mapping = dict()
index_month_mapping = dict()

for i, m in enumerate(month):
    month_index_mapping[m] = i
    index_month_mapping[i] = m

In [363]:
day = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
day_index_mapping = dict()
index_day_mapping = dict()

for i, d in enumerate(day):
    day_index_mapping[d] = i
    index_day_mapping[i] = d

In [365]:
hour = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 
        '20', '21', '22', '23']
hour_index_mapping = dict()
index_hour_mapping = dict()

for i, h in enumerate(hour):
    hour_index_mapping[h] = i
    index_hour_mapping[i] = h

<h2> Pass the matrix to the algorithm </h2>

In [330]:
def TLDA(matrix, num_topic = 20, num_iteration = 2000):
    model = lda.LDA(n_topics=num_topic, n_iter=num_iteration, random_state=1)
    model.fit(main_matrix)
    return model.doc_topic_, model.topic_word_
user_pattern_matrix, pattern_checkin_matrix = TLDA(main_matrix)

INFO:lda:n_documents: 1083
INFO:lda:vocab_size: 81320
INFO:lda:n_words: 227428
INFO:lda:n_topics: 20
INFO:lda:n_iter: 2000
INFO:lda:<0> log likelihood: -3580934
INFO:lda:<10> log likelihood: -2883598
INFO:lda:<20> log likelihood: -2821941
INFO:lda:<30> log likelihood: -2779449
INFO:lda:<40> log likelihood: -2750321
INFO:lda:<50> log likelihood: -2730357
INFO:lda:<60> log likelihood: -2715271
INFO:lda:<70> log likelihood: -2704131
INFO:lda:<80> log likelihood: -2696292
INFO:lda:<90> log likelihood: -2689748
INFO:lda:<100> log likelihood: -2683267
INFO:lda:<110> log likelihood: -2679024
INFO:lda:<120> log likelihood: -2674649
INFO:lda:<130> log likelihood: -2672772
INFO:lda:<140> log likelihood: -2669673
INFO:lda:<150> log likelihood: -2667386
INFO:lda:<160> log likelihood: -2664986
INFO:lda:<170> log likelihood: -2663795
INFO:lda:<180> log likelihood: -2660811
INFO:lda:<190> log likelihood: -2659788
INFO:lda:<200> log likelihood: -2658512
INFO:lda:<210> log likelihood: -2658159
INFO:lda

INFO:lda:<1999> log likelihood: -2643577


<h2> pattern-venue distribution & pattern-venue matrix </h2>

In [394]:
# format : {pattern_1 : {venue_category_1 : a%, venue_category_2 : b%...}, pattern_2 : {...}, ...}
pattern_venue_distribution = defaultdict(lambda : defaultdict(lambda:0))

pattern_venue_matrix = [[0.0] * 251] * pattern_checkin_matrix.shape[0]
pattern_venue_matrix = np.array(pattern_venue_matrix)

for i in range(pattern_checkin_matrix.shape[0]):
    for j in range(pattern_checkin_matrix.shape[1]):
        pattern_venue_distribution[i][index_checkin_mapping[j][0]] += pattern_checkin_matrix[i, j]

for i, venues in pattern_venue_distribution.items():
    for venue, percentage in venues.items():
        j = venue_index_mapping[venue]
        pattern_venue_matrix[i, j] = percentage

<h2> pattern-hour distribution & pattern-hour matrix

In [417]:
# format : {pattern_1 : {hour_1 : a%, hour_2 : b%...}, pattern_2 : {...}, ...}
pattern_hour_distribution = defaultdict(lambda : defaultdict(lambda:0))

pattern_hour_matrix = [[0.0] * 24] * pattern_checkin_matrix.shape[0]
pattern_hour_matrix = np.array(pattern_hour_matrix)

for i in range(pattern_checkin_matrix.shape[0]):
    for j in range(pattern_checkin_matrix.shape[1]):
        pattern_hour_distribution[i][index_checkin_mapping[j][1].split('-')[2]] += pattern_checkin_matrix[i, j]

for i, hours in pattern_hour_distribution.items():
    for hour, percentage in hours.items():
        j = hour_index_mapping[hour]
        pattern_hour_matrix[i, j] = percentage

<h2> Evaluation of TLDA</h2>

__Inputs:__
1. top venue categories `V*` for each pattern.
2. top time periods (hours) `T*` for each pattern.
3. all the check-in activities `SW`.

__High level:__

1. we firstly define a segmentation `S_{one set}` for each __top venue category__ `v*` in each pattern:
<img src="one_set.png">

We use `S` to denote the set of all segmentations `S_{one set}`, and `|S|` = `Q`.
2. For each segmentation `S_{one set}`, we calculate the normalised pointwise mutual information (NPMI) for `v*-T*` vector and `V*-T*` vector, respectively: 
<img src="little_w.png">

where `P (v*, t*_j )` is the probability of the co-occurrence of `v*` and `t*_j`. 

3. After calculating the NPMI value for each venue category, we aggregate them to obtain the jth element of the time vector of `V∗` by the following equation:
<img src="big_w.png">

where `v*_i` represents the ith venue category in `V*`.

4. Cosine similarity is then calculated between pairs of context vectors `w_q` and `W_q`, and then obtain the final score
<img src="mq.png">
<img src="m.png">

__Pseudocode:__
<img src="pseudo.png">

In [None]:
# pattern_venue_matrix 
# pattern_hour_matrix
V_star = defaultdict(list)
