<center><h1>TLDA</h1></center>

<img src="tlda.png">

1. `theta_u` : pattern distribution of user *u*.
2. `phi_t` : pattern distribution of time *t*.
3. `psi_z` : venue category distribution of pattern *z*
4. sfs

Analogous to the traditional LDA : `document` - `user/time`, `topic` - `cultural pattern`, and `word` - `venue`.

<h2>Data Extraction</h2>

1. `checkin_data` : {*user1* : [(*venue_category_1*, *time_1*), (*venue_category_2*, *time_2*),..., (*venue_category_n*, *time_n*)],*user2* : [(*venue_category_1*, *time_1*), (*venue_category_2*, *time_2*) ...], *user3*...}.
2. `venues` : list of venue ids.
3. `venue_categories` : list of venue categories.


In [7]:
from collections import defaultdict
import numpy as np

In [8]:
# Month(str): Jan Feb [Mar not included] Apr May Jun Jul Aug Sep Oct Nov Dec
# day: Mon, Tue, Wed, Thu, Fri, Sat, Sun
# Hour(int): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23


# data format: (user1 (venue_category_1, time_1,), (venue_category_2, time_2), (venue_category_3, time_3))
# time format: month-day-hour; Example: (Oct-Wed-13)

# our main data
checkin_data = defaultdict(list)

# store all venues (not venue categories)
venues = set()

# The encoding problem does not exist in windows OS
nyc = open("nyc.txt", encoding='ISO-8859-1')

i = 0

# data extraction
for checkin in nyc:
    checkin_time = (checkin.split('\t')[-1]).split(' ')
    
    # time in the correct format
    time = checkin_time[1] + '-' + checkin_time[0] + '-' + checkin_time[3].split(':')[0]
    
    # corresponding venue category, combine categories into a single category
    category = checkin.split('\t')[3]
    if 'Restaurant' in category:
        venue_category = 'Restaurant'
    elif 'Joint' in category:
        venue_category = 'Food Joint'
    elif 'Museum' in category:
        venue_category = 'Museum'
    else:
        venue_category = category
    
    # one checkin tuple of this user
    single_checkin = (venue_category, time)

    #user id
    user = checkin.split('\t')[0]
    
    checkin_data[user].append(single_checkin)


In [9]:
# format : {venue_category_1 : count 1, venue_category_2 : count 2, ...}
venue_count = defaultdict(lambda : 0)
for user, checkins in checkin_data.items():
    for checkin in checkins:
        venue_count[checkin[0]] += 1

In [10]:
# delete all checkins whose venue categories have not been visited over 1200 times
for user, checkins in checkin_data.items():
    new_checkins = []
    for checkin in checkins:
        if venue_count[checkin[0]] >= 1200:
            new_checkins.append(checkin)
    checkin_data[user] = new_checkins

In [11]:
# remove users who visited less than 100 places
remove = []
for user, checkins in checkin_data.items():
    if len(checkins) < 100:
        remove.append(user)
for r in remove:
    del checkin_data[r]

In [12]:
print("Number of users: {}".format(len(checkin_data)))
venue_categories = set()

for user, checkins in checkin_data.items():
    for checkin in checkins:
        venue_categories.add(checkin[0])
print("Number of venue categories: {}".format(len(venue_categories)))

total_checkins = 0
for user, checkins in checkin_data.items():
    total_checkins += len(checkins)
    
print("Number of checkins: {}".format(total_checkins))

max = 0
min = 5000
for user, checkins in checkin_data.items():
    if len(checkins) <= min:
        min = len(checkins)
    if len(checkins) >= max:
        max = len(checkins)
print("Maxmimum number of check-ins per user: {}".format(max))
print("Minimum number of check-ins per user: {}".format(min))

distinct_checkins = set()
for checkins in checkin_data.values():
    for checkin in checkins:
        distinct_checkins.add(checkin)
print("Number of distinct check-ins: {}".format(len(distinct_checkins)))

# store categories into a file
f = open('venue_categories.txt', 'w+')

for v in venue_categories:
    f.write(v + '\n')

Number of users: 782
Number of venue categories: 40
Number of checkins: 158746
Maxmimum number of check-ins per user: 2063
Minimum number of check-ins per user: 100
Number of distinct check-ins: 39936


In [13]:
number_of_users = len(checkin_data)

number_of_venue_categories = len(venue_categories)

number_of_distinct_checkins = len(distinct_checkins)

number_of_checkins = total_checkins

<h2> Basic info - Before </h2>

1. `1083` users, with id from `1` to `1083`. 
2. `38333` venues.
3. `251` venue categories.
4. `227428` checkins
5. Maxmimum number of check-ins per user: `2697`
6. Minimum number of check-ins per user: `100`
7. Number of distinct check-ins: `81320`


<h1>Basic info - After</h1>

1. `782` users
2. `38333` venues.
3. `251` venue categories.
4. `158746` checkins
5. Maxmimum number of check-ins per user: `2063`
6. Minimum number of check-ins per user: `100`
7. Number of distinct check-ins: `39936`


<h2>What the author did</h2>

1. first filter cultural fans based on users with at least 20 check-ins. 
2. Besides a venue category label, also represent a temporal label for each cultural check-in with three levels of identifiers, including month of year (Oct), day of week (Fri), and hour of day (13). Following this form of expression, a user’s check-in history can be represented as `(User3, ((Concert hall, JulFri20), (Golf, OctSun10), (Yoga, AprFri18))`, for example.
3. Run the TLDA model with the optimum number of patterns *K* given by TCV. We adopt 7 numbers from 3 to 9 as candidates, run the TLDA for 100 iterations each, and get their respective average TCV scores. Select *K* with the highest score.

<h2> What we did with respect to points above</h2>

1. For our data set, the minimum number of checkins per user is 100 (the maximum is 2063), therefore, there is no need to trim the data.
2. We followed the author's approach, however, based on the heatmap, the cultural pattern with respect to day is not signification, therefore, we might change the time format to `(month-hour)` instand of `(month-day-hour)`.

<h2> My approach </h2>

The input to `lda` library has to be a document-term matrix `X` where `X_{ij}` = the number of times term at index `j` appears in document `i`. <br>
There are 782 users, in other words, 782 "documents", and 39936 distinct check-in data, in other words, 39936 "words". So we construct a 782 by 39936 matrix as the input. <br>
Because users in the `checkin_data` is not ordered based on their ids, we have to create a user_id - index mapping. Similarily, we also create a checkin-index mapping. <br>
Then, after the LDA, we also need to map index back to users and checkins. Therefore, we also need a index-user_id mapping and index-checkin mapping which is exactly the reverse of two data structures above.

<h2> Potentianl issues </h2>

Here we adopt the notion: `user` - `document`, `cultural pattern` - `topic`, `checkin` - `words`. 
1. The issue concerns me the most this the low frequencies of all words. Within a document, it is less likely (stil possible) to have the same word occurs twice. Within the corpus which consists of 1083 documents, on average, a word appears four times cross these 1083 documents. This is not a very good ratio. 

In [14]:
import lda
import numpy as np
import math
from collections import defaultdict
from scipy import spatial

<h2> Construct the user_id and checkin mappings </h2>

In [15]:
#checkin_data : {user1 : [(venue_category_1, time_1), (venue_category_2, time_2), (venue_category_3, time_3...)], 
#                user2 : [(venue_category_13, time_13), (...) ...] ...}


user_index_mapping = dict()

index_user_mapping = dict()

for index, user in enumerate(checkin_data.keys()):
    user_index_mapping[user] = index
    index_user_mapping[index] = user

    
checkin_index_mapping = dict()

index_checkin_mapping = dict()

index = 0
for checkins in checkin_data.values():
    for checkin in checkins:
        if checkin_index_mapping.get(checkin) == None:
            checkin_index_mapping[checkin] = index
            index_checkin_mapping[index] = checkin
            index += 1

<h2> Main user-checkin matrix </h2>

In [16]:
# construct and initialize the main matrix with all elements equal to zero
main_matrix = [[0] * number_of_distinct_checkins] * number_of_users
main_matrix = np.array(main_matrix)

# fill in the main matrix with data, suppose user1 has checkin data (venue_category_1, time_1), then element of main matrix at 
# index i = index_user_mapping[user1], j = index_checkin_mapping[(venue_category_1, time_1)] is 1
for user, checkins in checkin_data.items():
    i = user_index_mapping[user]
    for checkin in checkins:
        j = checkin_index_mapping[checkin]
        main_matrix[i,j] += 1 # it is possible that the same time occurs twice!!!!!!!!!!!!

In [17]:
# check if code is correct
index = 123
print(np.sum(main_matrix[index]))

user = index_user_mapping[index]
print(len(checkin_data[user]))

print("--------------------------------")

l1 = list()
l2 = set() # because there are duplicates
x = 0
for i in main_matrix[index]:
    if i != 0:
        l1.append(x)
    x += 1

for c in checkin_data[user]:
    l2.add(checkin_index_mapping[c])
    
l2 = list(l2)
print(sorted(l1) == sorted(l2))

161
161
--------------------------------
True


<h2> venue_category, month, day and hour mappings </h2>

In [18]:
venue_categories = set()
for user, checkins in checkin_data.items():
    for checkin in checkins:
        venue_categories.add(checkin[0])
        
venue_index_mapping = dict()
index_venue_mapping = dict()

for i, venue in enumerate(venue_categories):
    venue_index_mapping[venue] = i
    index_venue_mapping[i] = venue

In [19]:
# THERE IS NO MARCH IN THE DATA SET!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_index_mapping = dict()
index_month_mapping = dict()

for i, m in enumerate(month):
    month_index_mapping[m] = i
    index_month_mapping[i] = m

In [20]:
day = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
day_index_mapping = dict()
index_day_mapping = dict()

for i, d in enumerate(day):
    day_index_mapping[d] = i
    index_day_mapping[i] = d

In [21]:
hour = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 
        '20', '21', '22', '23']
hour_index_mapping = dict()
index_hour_mapping = dict()

for i, h in enumerate(hour):
    hour_index_mapping[h] = i
    index_hour_mapping[i] = h

<h2> Pass the matrix to the algorithm </h2>

In [22]:
def TLDA(matrix, num_topic = 9, num_iteration = 2000):
    model = lda.LDA(n_topics=num_topic, n_iter=num_iteration, random_state=1)
    model.fit(main_matrix)
    return model.doc_topic_, model.topic_word_
user_pattern_matrix, pattern_checkin_matrix = TLDA(main_matrix)

INFO:lda:n_documents: 782
INFO:lda:vocab_size: 39936
INFO:lda:n_words: 158746
INFO:lda:n_topics: 9
INFO:lda:n_iter: 2000
INFO:lda:<0> log likelihood: -2257865
INFO:lda:<10> log likelihood: -1816302
INFO:lda:<20> log likelihood: -1794237
INFO:lda:<30> log likelihood: -1778022
INFO:lda:<40> log likelihood: -1762681
INFO:lda:<50> log likelihood: -1749884
INFO:lda:<60> log likelihood: -1741483
INFO:lda:<70> log likelihood: -1733950
INFO:lda:<80> log likelihood: -1729304
INFO:lda:<90> log likelihood: -1724057
INFO:lda:<100> log likelihood: -1720995
INFO:lda:<110> log likelihood: -1717791
INFO:lda:<120> log likelihood: -1715682
INFO:lda:<130> log likelihood: -1713186
INFO:lda:<140> log likelihood: -1710515
INFO:lda:<150> log likelihood: -1709891
INFO:lda:<160> log likelihood: -1708914
INFO:lda:<170> log likelihood: -1708448
INFO:lda:<180> log likelihood: -1706322
INFO:lda:<190> log likelihood: -1705238
INFO:lda:<200> log likelihood: -1704822
INFO:lda:<210> log likelihood: -1703721
INFO:lda:<

INFO:lda:<1999> log likelihood: -1691976


<h2> pattern-venue distribution & pattern-venue matrix </h2>

In [23]:
# format : {pattern_1 : {venue_category_1 : a%, venue_category_2 : b%...}, pattern_2 : {...}, ...}
pattern_venue_distribution = defaultdict(lambda : defaultdict(lambda:0))

pattern_venue_matrix = [[0.0] * number_of_venue_categories] * pattern_checkin_matrix.shape[0]
pattern_venue_matrix = np.array(pattern_venue_matrix)

for i in range(pattern_checkin_matrix.shape[0]):
    for j in range(pattern_checkin_matrix.shape[1]):
        pattern_venue_distribution[i][index_checkin_mapping[j][0]] += pattern_checkin_matrix[i, j]

for i, venues in pattern_venue_distribution.items():
    for venue, percentage in venues.items():
        j = venue_index_mapping[venue]
        pattern_venue_matrix[i, j] = percentage

# the pattern #0
for v, p in pattern_venue_distribution[0].items():
    if p > 0.1:
        print('{} : {}'.format(v, p))

Other Great Outdoors : 0.10083183640204099
Neighborhood : 0.11102760174974402


In [44]:
pv = defaultdict(list)

for pattern, venue_cates in pattern_venue_distribution.items():
    for venue_cate, percentage in venue_cates.items():
        if percentage > 1/40:
            pv[pattern].append(venue_cate)


defaultdict(<class 'list'>, {0: ['Bus Station', 'Other Great Outdoors', 'Road', 'Food & Drink Shop', 'Neighborhood', 'Deli / Bodega', 'Park', 'Building', 'Residential Building (Apartment / Condo)', 'Government Building', 'Drugstore / Pharmacy', 'Medical Center'], 1: ['Home (private)', 'Food & Drink Shop', 'Neighborhood', 'Coffee Shop', 'College Academic Building', 'Medical Center'], 2: ['Restaurant', 'Food Joint', 'Bar', 'Food & Drink Shop', 'Park', 'Clothing Store', 'Coffee Shop'], 3: ['Home (private)', 'Bus Station', 'Road', 'Building'], 4: ['Restaurant', 'Bar', 'Food & Drink Shop', 'Airport', 'Office', 'Hotel', 'Coffee Shop'], 5: ['Train Station', 'Office', 'Coffee Shop'], 6: ['Train Station', 'Bus Station', 'Subway', 'Neighborhood', 'Deli / Bodega'], 7: ['Restaurant', 'Gym / Fitness Center', 'Food & Drink Shop', 'Airport', 'Hotel', 'Coffee Shop'], 8: ['Restaurant', 'Train Station', 'Salon / Barbershop', 'Bus Station', 'Road', 'Food & Drink Shop', 'Neighborhood', 'Deli / Bodega', 'P

In [55]:
import optics
# The encoding problem does not exist in windows OS
nyc = open("nyc.txt", encoding='ISO-8859-1')

#format : {pattern 1 : [point1, point2, point3, ...,], pattern2 : [...], ...}
pattern_point = defaultdict(list)

i = 0

# data extraction
for checkin in nyc:
    category = checkin.split('\t')[3]
    if 'Restaurant' in category:
        venue_category = 'Restaurant'
    elif 'Joint' in category:
        venue_category = 'Food Joint'
    elif 'Museum' in category:
        venue_category = 'Museum'
    else:
        venue_category = category
    
    lan = float(checkin.split('\t')[4])
    long = float(checkin.split('\t')[5])
    venue =  optics.Point(lan, long)
    
    for p, cate in pv.items():
        if venue_category in cate:
            pattern_point[p].append(venue)
    

In [56]:
import pickle
pickle.dump(pattern_point, open("pattern_point.pickle", "wb") )

[(40.745164, -73.982519), (40.690427, -73.954687), (40.779422, -73.955341), (40.719762, -74.250014), (40.826790, -73.949509), (40.883020, -74.075875), (40.790599, -73.980234), (40.752307, -73.971854), (40.741862, -73.989434), (40.742188, -73.987924), (40.901058, -74.150763), (40.924312, -73.996888), (40.786713, -74.175476), (40.786906, -74.175405), (40.885440, -74.138870), (40.817220, -73.947902), (40.739711, -73.982518), (40.751101, -73.981301), (40.804640, -73.937862), (40.701587, -73.957453), (40.911923, -73.782377), (40.759513, -73.831472), (40.831622, -74.136794), (40.870630, -74.097926), (40.808700, -73.958515), (40.677559, -73.744525), (40.679121, -73.749920), (40.680064, -73.761619), (40.773522, -73.956785), (40.748281, -73.985563), (40.940400, -73.962149), (40.844306, -74.043963), (40.782804, -73.951759), (40.730836, -73.997641), (40.730769, -73.997450), (40.702480, -73.799812), (40.774111, -73.959546), (40.736527, -73.990560), (40.712590, -74.006320), (40.896213, -73.876705),

<h2> pattern-hour distribution & pattern-hour matrix

In [51]:
# format : {pattern_1 : {hour_1 : a%, hour_2 : b%...}, pattern_2 : {...}, ...}
pattern_hour_distribution = defaultdict(lambda : defaultdict(lambda:0))

pattern_hour_matrix = [[0.0] * 24] * pattern_checkin_matrix.shape[0]
pattern_hour_matrix = np.array(pattern_hour_matrix)

for i in range(pattern_checkin_matrix.shape[0]):
    for j in range(pattern_checkin_matrix.shape[1]):
        pattern_hour_distribution[i][index_checkin_mapping[j][1].split('-')[2]] += pattern_checkin_matrix[i, j]

for i, hours in pattern_hour_distribution.items():
    for hour, percentage in hours.items():
        j = hour_index_mapping[hour]
        pattern_hour_matrix[i, j] = percentage

<h2> Evaluation of TLDA</h2>

__Inputs:__
1. top venue categories `V*` for each pattern.
2. top time periods (hours) `T*` for each pattern.
3. all the check-in activities `SW`.

__High level:__

1. we firstly define a segmentation `S_{one set}` for each __top venue category__ `v*` in each pattern:
<img src="one_set.png">

We use `S` to denote the set of all segmentations `S_{one set}`, and `|S|` = `Q`.
2. For each segmentation `S_{one set}`, we calculate the normalised pointwise mutual information (NPMI) for `v*-T*` vector and `V*-T*` vector, respectively: 
<img src="little_w.png">

where `P (v*, t*_j )` is the probability of the co-occurrence of `v*` and `t*_j`. 

3. After calculating the NPMI value for each venue category, we aggregate them to obtain the jth element of the time vector of `V∗` by the following equation:
<img src="big_w.png">

where `v*_i` represents the ith venue category in `V*`.

4. Cosine similarity is then calculated between pairs of context vectors `w_q` and `W_q`, and then obtain the final score
<img src="mq.png">
<img src="m.png">

__Pseudocode:__
<img src="pseudo.png">

<h2> Construct the V* and T* for each pattern </h2>

In [52]:
# foramt: {pattern_1 : [index_1, index_2, index_3, index_4, index_5], pattern_2 : [], ...}
V_star = defaultdict(list)
for i in range(pattern_venue_matrix.shape[0]):
    top_five_indices = pattern_venue_matrix[i].argsort()[-5:][::-1]
    V_star[i] = top_five_indices

T_star = defaultdict(list)
for i in range(pattern_hour_matrix.shape[0]):
    top_five_indices = pattern_hour_matrix[i].argsort()[-5:][::-1]
    T_star[i] = top_five_indices


<h2> Log based 2 function </h2>

In [53]:
def log_2(n):
    return (math.log(n)/math.log(2))

<h2> Cosine similarity function </h2>

In [54]:
def cosine_sim(l1, l2):
    return (1 - spatial.distance.cosine(l1, l2))

<h2> NPMI Function </h2>

In [55]:
#checkin_data : {user1 : [(venue_category_1, time_1), (venue_category_2, time_2), (venue_category_3, time_3...)], 
#                user2 : [(venue_category_13, time_13), (...) ...] ...}

# time format: month-day-hour; Example: (Oct-Wed-13)

# index_venue_mapping, index_hour_mapping 


def NPMI(v : "index of venue category v", t : "index of hour t", checkin_data, epsilon = 0.001, tau = 2):
    v_sum = 0
    t_sum = 0
    v_t_sum = 0

    for u, checkins in checkin_data.items():
        for checkin in checkins:
            if checkin[0] == index_venue_mapping[v]:
                v_sum += 1
            if checkin[1].split('-')[2] == index_hour_mapping[t]:
                t_sum += 1
            if checkin[0] == index_venue_mapping[v] and checkin[1].split('-')[2] == index_hour_mapping[t]:
                v_t_sum += 1

    p_v = v_sum / number_of_checkins
    p_t = t_sum / number_of_checkins
    p_v_t = v_t_sum / number_of_checkins

    upper = (p_v_t + epsilon)/(p_v * p_t)
    lower = p_v_t + epsilon

    numerator = log_2(upper)
    denominator = -log_2(lower)

    result = (numerator / denominator)**tau
    
    return result

In [56]:
S = set()
num_of_patterns = pattern_venue_matrix.shape[0]

for i in range(num_of_patterns):
    for v in V_star[i]:
        S_oneset = (v, tuple(V_star[i]), tuple(T_star[i]))
        S.add(S_oneset)

m = []

# S_oneset format: (v*, V*, T*)
for S_oneset in S:
    T = S_oneset[2]
    w = list(i * 0 for i in range(len(T)))
    W = list(i * 0 for i in range(len(T)))
    for i, t in enumerate(T):
        w[i] = NPMI(S_oneset[0], t, checkin_data)
        sum = 0
        for v in S_oneset[1]:
            sum += NPMI(v, t, checkin_data)
        W[i] = sum
    m.append(cosine_sim(w, W))

final_result = np.sum(m) / len(S)
print(final_result)

0.8833109243423494
